A scientists' view of scientometrics: Not everything that counts can be counted

Like it or not, attempts to evaluate and monitor the quality of academic research have become increasingly prevalent worldwide. Performance reviews range from at the level of individuals, through research groups and departments, to entire universities. Many of these are informed by, or functions of, simple scientometric indicators and the results of such exercises impact onto careers, funding and prestige. However, there is sometimes a failure to appreciate that scientometrics are, at best, very blunt instruments and their incorrect usage can be misleading. Rather than accepting the rise and fall of individuals and institutions on the basis of such imprecise measures, calls have been made for indicators be regularly scrutinised and for improvements to the evidence base in this area. It is thus incumbent upon the scientific community, especially the physics, complexity-science and scientometrics communities, to scrutinise metric indicators. Here, we review recent attempts to do this and show that some metrics in widespread use cannot be used as reliable indicators research quality.


Introduction
The field of scientometrics can be traced back to the work of the physicist Derek de Solla Price [2] and the linguist/businessman Eugene Garfield [3]. It is the quantitative study of the impact of science, technology, and innovation [4]. This frequently involves analyses of citations and facilitates, (indeed, encourages) the evaluation and ranking of individual scientists, research groups, universities and journals. The closely related (sub-)field of bibliometrics is concerned with measuring the impact of scholarly publications. Perhaps the most famous indicator of the productivity and impact of a scientist is the so-called h-index (named by its creator Jeorge E. Hirsch) [5] and the most famous indicator of journal quality is the impact factor, devised by Garfield and Irving Sher [6,7].
The UK is at the forefront of group or departmental research evaluation and has been for a number of decades. The first Research Assessment Exercise (RAE) was carried out in 1986, introducing an explicit, formalised assessment process of research quality. The RAE was adapted and developed to a more

The RAE and REF; why size matters; the h-index and the NCI
For the RAE in 2008, three aspects of group or departmental quality were considered: research outputs, research environment and research esteem. The first of these mostly entailed publications, but for some disciplines software, patents, artefacts, performances or exhibitions were also considered. Research environment was also quantified at RAE2008 and institutions were asked to provide information on funding, infrastructure, vitality, leadership, training, accommodation and so on. The third component for RAE2008 was esteem and indicators included prizes, honours, professional services and other activities. The precise manner in which outputs, environment and esteem fed into the overall final RAE score was dependent upon discipline; in pure and applied mathematics, statistics and the computer sciences, they were weighted at 70%, 20% and 10%, respectively, while in biology and other subjects they were weighted at 75%, 20% and 5%, respectively. RAE2008 estimated the research quality of each submitted research unit in a number of academic disciplines. These estimates were presented as profiles, detailing the proportions of research activity carried out at each of five levels: 4* (world-leading research); 3* (internationally excellent research); 2* (research that is internationally recognised); 1* (research recognised at a national level) and unclassified research. Following the exercise, a formula was used to determine how funding is allocated to higher education institutes for the subsequent years. The formula used by the Higher Education Funding Council for England, immediately after RAE2008, valued 4* and 3* research seven and three times, respectively, more than 2* research and allocated no funding to 1* and unclassified research. We use that formula to condense the research profile of a unit into a scalar as follows: if p n* represents the percentage of a team's research which was rated n*, then a proxy for the team's quality is s = p 4* + 3 7 p 3* + 1 7 p 2* . (2.1)

13803-2
The research "strength" of a unit 1 is then given by S = sN , where N is the size of the submitted team. The amount of money flowing into the university from the Higher Education Funding Council for England was then a function of S. When research quality s is plotted against N , an interesting pattern emerges for many subject areas.
It was shown in [17,18] that quality increases linearly with size up to a certain point, identified as the discipline-dependent point at which research groups tend to become unwieldy and may start to fragment. This is similar to the Ringelmann effect [23] in social psychology and marked by a Dunbar number [24] which is discipline dependent. A statistical-physics-inspired, mean-field-type theory exposes the existence of a second important group size which may be identified as the critical mass and is also dependent on discipline [17]. For theoretical physics, for example, the critical mass is 6.5 and the Dunbar number is 13. For experimental physics, the corresponding numbers are 13 and 25, respectively. With the critical mass and Dunbar numbers to hand, one may classify groups according to their size. Small groups are below critical mass; medium ones are bigger than critical mass but smaller than the Dunbar number; and groups of still more members are classified as large. Thus, a group of 15 theoretical physicists would be deemed large, for example, while the same number of experimentalists would be considered as medium in size. We will shortly see that size matters when comparing metric indicators to peer-review estimates of research quality. For REF2014, a number of changes were introduced vis-à-vis RAE2008. Firstly, that the esteem category was replaced by impact. The latter, not to be confused with academic or citation impact, was defined as "an effect on, change or benefit to the economy, society, culture, public policy or services, health, the environment or quality of life, beyond academia" [25]. The three categories outputs, environment and impact were then weighted at 65%, 15% and 20%, respectively. Another change was that, while for the RAE research was categorised into 67 academic disciplines, in the 2014 REF there were only 36 units of assessment. The Applied Mathematics Unit of Assessment, for example, (which included some theoretical physics groups), was a category at RAE2008, but for REF2014 it was merged with Pure Mathematics, Statistics and Operational Research. One may argue, therefore, that RAE2008 was more "fine-grained" than REF2014.
The next REF is expected to take place in 2021. The rules have not yet been decided, but it is expected that it will build upon REF2014 although there will be incremental changes. The precise role of metrics at REF2021 is yet to be decided but indications are that peer review should remain the primary method of research assessment. Our analysis strongly supports this direction -not only for the UK, but for all national exercises of this type.
The question we wish to address is whether or not metrics such as the h-index or NCI should be used for exercises such as the REF. The h-index seeks to measure the citation impact of a researcher along with the volume of their productivity. It is defined as the number of papers an author has produced that each have been cited h times or more. Its scalar simplicity renders it very attractive to policy makers and managers. Although originally introduced as a measure at an individual level, the h-index can also be applied to estimate the productivity and scholarly impact journals, research groups, departments or universities [10,26].
Thomson Reuters Research Analytics has developed the so-called normalised citation impact (NCI) as another measure of a department's citation performance in a given discipline [15,16]. A useful feature is that it attempts to take account of differences in citation rates across different disciplines by "rebasing" the total citation count for each paper to an average number of citations per paper for the year of publication and either the field or journal in which the paper was published. The measure is determined for an entire group or department and then normalised by the group size. It is, therefore, a specific (per-head) measure (also called intensive in the parlance of statistical physics). Scaled up to the size of a group or department, the corresponding absolute measure (extensive in statistical-physics terminology). Here, we denote the specific NCI by i and its absolute counterpart by I where I = i N . 1 Research strength as defined above can be compared to the so-called "research power", which is a measure that has recently gained in popularity in the UK. Research power is the simple grade point average of a submission (= 4*p 4* + 3*p 3* + 2*p 2* + p 1* ) multiplied by N . Other measures are possible; following lobbying by pressure groups the funding formula (2.1) changed a number of times to concentrate more money into those groups with the highest quality profiles. Here, we stick with formula (2.1) as it has the advantage of clearly demarcating four quality levels prior to political influence. We have checked that small changes in the formula do not deliver changes to the outcomes of our analysis.

13803-3
In section 4, we report on a quantitative comparison of both of these indicators against expert peer review measures of the quality of research groups coming from RAE/REF after a brief qualitative discussion in section 3.

Should metrics be used in the research evaluation schemes?
The debate as to whether metrics should or should not be used in national evaluation frameworks is a long one within the academic, scientometric, university-management and policy-making communities internationally. Although flawed in many ways, systems based purely upon peer review enjoy the highest confidence of the scientific community itself [9].
Flaws include the absence of trusted methods to account for different levels of expertise, stringency and bias amongst assessors and the absence of an acceptable way to normalise results across different disciplines. (A new approach to overcome some such difficulties has recently been developed [27].) Another objection is that peer-review-based exercises such as the RAE and REF are also expensive [28]. It has been estimated that the total cost to the UK of running REF2014 was £246M. That amount comprises £14M in costs for the UK higher-education funding bodies which run the exercise and £232M in costs to the higher education community itself. The latter figure includes £19M for the panellists' time and £212M for preparing the REF submissions (about £4K for each of the 52 077 researchers submitted). Costs are, therefore, a prime reason forwarded by advocates for replacing peer-review exercises by automated systems based on metrics. Another is the burden in terms of time taken away from research activity in order to prepare REF submissions. 2 Metrics were not officially used in the earlier RAE or in the REF, although there was nothing to prevent individual assessors from determining the citation counts of individual papers or looking up the citation records of individual researchers. For the 2014 exercise, REF panels were allowed to use citation data, but only to inform their judgements (e.g., to decide how academically significant a paper was) which were predominantly based on peer review (assessors were advised to recognise the significance of outputs beyond academia as well). To this end, citation data were sourced centrally by the REF team using the Scopus database. Assessors were, however, instructed not to refer to additional bibliometric data, such as impact factors or other journal-level metrics in their deliberations.
In comparison, in France, prior to the creation of the AERES (Agence d'Évaluation de la Recherche et de l'Enseignement Supérieur) in 2006, research assessment was essentially performed by the CNRS (Centre National de la Recherche Scientifique) solely on the basis of evaluation by the peers. Panels of peers were composed in assessment committees, visiting the laboratories they were assigned to. In a given discipline, there were many different panels of experts of this kind. This format of a committee of pairs visiting the laboratories remained with the AERES and then with its successor, the HCERES (Haut Conseil de l'Évaluation de la Recherche et de l'Enseignement Supérieur) since 2013. The novelty introduced with the AERES is the scale of evaluation campaigns and the field of expertise, evaluation being performed at the scale of research teams as well as the scale of universities, or even evaluation of the CNRS itself! Of course, the use of bibliometrics was progressively introduced into the reports and as a guiding element for the evaluation. AERES had even marked the laboratories and universities according to a rating system A + , A, B, and C, similar to that of the British system. This rating system is now abandoned.
The Australian Research Council used Scopus as the citation and bibliometrics provider for the Excellence in Research for Australia (ERA) schemes both in 2010 and 2012. Italy's Research Evaluation Exercise will use "informed" peer review. This means that, in areas such as the mathematical, natural, engineering and life sciences, peer evaluation will be supported by bibliometric information from the Web of Science 2 We are reminded of the novel The Mark Gable Foundation by Leo Szilard, in which advice to retard scientific progress is: "Take the most active scientists out of the laboratory and make them members of . . . committees. And the very best . . . should be appointed as Chairmen". In this way "the best scientists would be removed from their laboratories and kept busy on committees passing on applications for funds. Secondly, the scientific workers in need of funds will concentrate on problems which are considered promising and are pretty certain to lead to publishable results. . . . By going after the obvious, pretty soon Science will dry out. Science will become something like a parlor game. Some things will be considered interesting, others will not. There will be fashions. Those who follow the fashion will get grants. Those who won't, will not, and pretty soon they will learn to follow the fashion too" [29].

13803-4
and Scopus citation databases. Evaluators will use both information about the impact of individual articles (through numbers of citations) and the quality of the journals in which they are published (through the Impact Factor and other indicators). In humanities and the social sciences, however, the system uses peer evaluation only.
There is no single procedure to assess research institutions in Ukraine. Regular evaluations are mostly based on formal reports, supported by scientometrics. However, their use is often rather haphazard. Dangerously attractive, simple metrics are sometimes used without clear understanding of their peculiarities.
Many scientists and other academics object to the misuse of the scientometric quantification of their research. A fundamental objection is that the metrics are doomed to fail if their intended task is to aid management and funding of science by making it systematic and objective. In 1977, Garfield himself cautioned against the misuse of citation analyses [14]. In the forty years since, however, those words appear to have fallen on deaf ears as citation-misuse is rife [9]. In response, the San Francisco Declaration on Research Assessment [30] was initiated by a group of experts, editors and publishers to call for improvements in the ways in which scientific research is evaluated. Similarly motivated by the fact that research evaluation "is increasingly driven by data and not by expert judgement", the Leiden Manifesto for Research Metrics has been drawn up, comprising ten principles for the measurement of research performance [31]. In the UK, the "Metric Tide" steering group [9] felt it necessary to set up a website as a forum for ongoing discussion of these issues; to "celebrate and encourage responsible uses of metrics" but also to "name and shame bad practices when they occur". (Every year they plan to award a "Bad Metric" prize to the most inappropriate use of quantitative indicators [32].) It is claimed that the misuse of such metrics is changing the nature of science; they are damaging curiosity-driven research as scientists are forced to maximise their personal metrics instead. In a system which excessively rewards novel findings over confirmatory studies, the most rational research strategy is for scientists to spend most of their effort seeking novel results through small studies with low statistical power [33]. As a result, half of the studies they publish would contain erroneous conclusions. The existence of a "trade-off" between productivity and rigour was also claimed in [34]: poor methods result from incentives that favour them and one of these is the priority of publication over discovery for career advancement. These are examples of Goodhart's law: when a quantitative metric is introduced as a proxy to reward academics, these metrics become targets and cease to be good measures [35,36].
Notwithstanding these objections, we next ask whether or not metrics are capable of approximating the results of RAE or REF. Again, our task is motivated by the widespread view of peer review as a "gold standard" [9] and the desires by some to replace or inform it. We shall find that at least the NCI and h-index are not capable of approximating RAE/REF. This suggests that the UK should persist with its peer-review based REF-type evaluation system and that other countries should also seek to move in this direction and away from metrics-driven exercises.

The NCI
In [19,20], NCI values were compared with RAE2008 measures of research quality and strength for various groups in various disciplines, from the natural to social sciences and humanities. The results are reproduced here in table 1. (We refer the reader to the original literature for tests of significance [19,20].) Actually, only the outputs component of RAE results are used in the determination of the correlation coefficients in this instance -i.e., neither the environment nor the esteem measures are used here. This is because only outputs contribute directly to the NCI. We label the corresponding quality and strength measures by s 1 and S 1 , respectively. The table lists the values of the Pearson correlation coefficient for extensive (absolute) quantities (namely I vs. S 1 ) in boldface and those for intensive (specific) quantities (namely i vs. s 1 ) are given in regular typeface. Figure 1 gives examples of i vs. s 1 and I vs. S 1 plots for the case of chemistry research groups.
One observes that the intensive measure i = I /N is poorly correlated with group quality s 1 = S 1 /N for all disciplines and for all group sizes. One of the best correlations between i -and s 1 -values is in the case of chemistry, but even then the Pearson correlation coefficient is only 0.6. Since NCI and RAE scores are also used to rank research groups, we also evaluated the Spearman correlation coefficient between 13803-5 Table 1. Correlation coefficients between absolute values I and S 1 (boldface) and specific values i and s 1 (regular typeface) calculated for several different disciplines. Here, the subscript 1 indicates that only the "outputs" sector of the RAE results are used. Pearson's correlation coefficient r is presented for all groups in a given discipline and separately for large groups and small/medium groups. We also present Spearman's coefficient for ranked values for all groups. axis. On closer examination, however, these correlations are the best for large groups in these disciplines; for small/medium groups, they fall below 0.90. The correlations are also worse for other disciplines, even for their large groups; e.g., for sociology and history they were 0.88. This outcome suggests the almost paradoxical result that the NCI could possibly form a basis for deciding on funding amounts for research institutions, but only for the sciences and only for large research groups. It should not be used in any other cases (not for social sciences, humanities and not even for science groups with sub-Dunbar numbers of staff). And certainly, it should never be used as a basis for ranking or comparing the research groups. Further details are given in [19,20].

The departmental h-index
At this stage, we have established that the NCI is not a good specific (intensive) indicator for research quality. Is there a better metric, perhaps? In [21], we demonstrated that the departmental h-index [10,26] has indeed a better correlation with the RAE-measured strength index s than has the NCI, i . A departmental h-index of n, say, indicates that n papers authored by researchers from a given department in a given discipline were cited at least n times over a given time period. The departmental h-index uses data from all researchers from a given department, not only those submitted to RAE or REF. However, in practice, individuals with weaker citation records are swamped out by those with stronger ones, so that it can be dominated by a few individuals -even by a single, extremely strong one.
We determined departmental h-indices for universities which submitted to RAE2008 within the disciplines of biology, chemistry, physics and sociology. The citation data we used were taken from the Scopus database. In order to estimate h, we filtered the Scopus data to extract only those publications which correspond to UK and which were published in the period 2001-2007, so to compare with RAE2008. We selected subjects most closely corresponding to the above four disciplines using Scopus subject categories (see [21] for details). Unlike for the RAE and REF, where authors' affiliations are determined by their addresses at the assessment census date, the author address at the time of publication determines to which university a given output is allocated for the departmental h-index. A small number of institutions were not listed in the Scopus database after refining the search result, so it was not possible to determine h-indices in these cases. This means that the set of universities contributing to table 2 slightly differs from that contributing to table 1. The results we present in table 2 are for institutions that could be accessed by Scopus. The table shows that the departmental h-index is indeed better correlated with overall RAE-measured research quality. However, the correlations between the h-index and the RAE results are still too small to replace the peer-review exercise by metrics. (We again refer the reader to the original literature for significance tests [19,20].)

(Mis-)Predicting REF
We have seen that neither the NCI nor the departmental h-index have good enough correlations with RAE results to contemplate replacing the peer-review exercise by metrics. To demonstrate that forcefully, we decided to use the departmental h-index to predict outcomes of RAE2014. The idea was that if simple citation-based metrics are ever to be used as some sort of proxy for peer review, one would expect them to be able to predict at least some aspects of the outcomes of such exercises. Even a limited success might suggest that a citation metric could serve at least as a "navigator" -to help guide research institutes as they prepare for the expert exercises. For example, research managers may be interested in whether metrics could indicate whether or not they are likely to move up or down the REF league tables in various subject disciplines. We placed our predictions for the rankings in biology, chemistry, physics and sociology on the arXiv in November 2014 -before the REF results were officially announced. These were subsequently published as [21]. After the REF, the results were announced in December 2014, we revisited our study [22]. The correlations between the REF results and the h-index predictions are given in table 3. We also tried to anticipate whether individual institutions would move up or down in the rankings between RAE2008 and REF2014. The results for the correlations between our predictions and the actual results are also listed in table 3. E.g., for physics, chemistry and biology, it was 0.26, 0.05 and −0.15, respectively. If we restrict ourselves to outputs only, the correlations are even worse at 0.02, 0.20 and −0.33. As commented in the press later, one would find better estimates of movement in the league tables by tossing dice! Our results are published as [22,37].

Discussion
The vast majority of academics are opposed to the increased use of automated metrics to monitor research activity. A concern is that, because "inappropriate indicators create perverse incentives" [9], inexpert use of such metrics to simplify the bases for important judgments, decisions and league tables surely leads to violations of the age-old and treasured principle of academic freedom because researchers are forced to chase citation-based metrics rather than allowed to follow where their curiosity leads. Here, we have reported on a series of publications that show that these fears are well founded; metrics are a poor measure of research quality. Our advice to those in authority who are attracted to such simple measures is: do not be fooled by their quantitative nature; they are crude at best, and their misuse can damage academic research.
In recent years, the UK has commissioned at least two major reports on the matter of metrics and research evaluation. The first of these was the Wilsdon Report, titled The Metric Tide: Report of the Independent Review of the Role of Metrics in Research Assessment and Management [9] and concluded that peer review should remain the primary method of research assessment. These findings were endorsed by the more recent Stern Review [38].
As pointed out in [9], "There are powerful currents whipping up the metric tide". However, "Across the research community, the description, production and consumption of "metrics" remains contested and open to misunderstandings". The tenth Principle of the Leiden Manifesto calls for scrutinisation of indicators [31] and the San Francisco Declaration urges "a pressing need to improve the ways in which the output of scientific research is evaluated" [30]. For these and other reasons, "There is a need for more research on research" [9]. To respond to such calls, it is important that scientists turn their tools to their own discipline too. Indeed, amongst other academic evidence, the Metrics Tide report [9] made considerable use of [20][21][22], which, in turn built upon [19,20], which themselves were inspired by statisticalphysics mean-field-inspired theories [17,18]. In this paper, we have tried to contextualise such "research on research" within the UK national context while highlighting their implications internationally too.
Moreover, and as pointed out in [28], the £4 000 that it cost to submit and evaluate each researcher to REF amounts to only 1% of what it costs to employ them over a six-year period. Viewed in this way, REF is actually a rather cheap exercise. We suggest that, if other countries insist on monitoring their academic researchers, it would be prudent for them to also move towards peer-review-based exercises and away from metrics.