A mechanism for evolution of the physical concepts network

We suggest an underlying mechanism that governs the growth of a network of concepts, a complex network that reflects the connections between different scientific concepts based on their co-occurrences in publications. To this end, we perform empirical analysis of a network of concepts based on the preprints in physics submitted to the arXiv.org. We calculate the network characteristics and show that they cannot follow as a result of several simple commonly used network growth models. In turn, we suggest that a simultaneous account of two factors, i.e., growth by blocks and preferential selection, gives an explanation of empirically observed properties of the concepts network. Moreover, the observed structure emerges as a synergistic effect of these both factors: each of them alone does not lead to a satisfactory picture.

Networks of concepts, i.e., semantic networks that reflect the relations between concepts in a certain domain are ubiquitously met in different spheres of modern life [1]. Their importance is both due to the fundamental reasons and numerous applications, ranging from ontologies in computer and information science [2] to visual knowledge maps that serve as an aid showing where to look for a certain knowledge [3]. Such networks are of particular interest for the logology -'science of science', that aims at quantitative understanding of the origins of scientific discovery and creativity, its structure and practice [4,5]. Scientific papers are an ideal source to investigate such processes, providing validated and open results of scientific creativity that are recorded in text formats and supplied by numerous supporting pieces of information. A common approach to the quantitative description of the knowledge structure is via the analysis of its projections to semantic spaces for different domains, see e.g., [6] and references therein. The latter can be modelled as complex networks based on topic-indicating labels. To give a few examples, one can mention here the networks of papers in physics that co-used PACS numbers [7,8], biomedical papers that co-mentioned the same chemical entities [9], papers in cognitive neuroscience [10] and in quantum physics [6] with co-occurence of predefined concepts, Wikipedia pages devoted to mathematical theorems [11], etc. In all the above cases, complex network formalism enables quantitative analysis of similarities between different entities which are typically considered as indicators of topical relatedness and, therefore, as projections of knowledge.
Besides, the networks discussed above rise as an outcome of a dynamical process at which a new knowledge is acquired. Innovations themselves can be interpreted as an emergence of new concepts or new relations between the existing ones [12][13][14]. Modelling such process is a challenging task both for its fundamental relevance and numerous practical implementations. The process of a scientific discovery itself is governed by the structure of scientific knowledge. At the same time, it leads to changes in Table 1. Some features of networks of concepts addressed in our study. An empirically observed network (first line) is compared with three different models discussed in the paper: Erdős-Rényi, Barabási this structure: in other words, they dynamically update each other. Presence of such a co-evolution is a typical feature of any complex system [15,16] and is reflected, in particular, in the growth dynamics of the underlying complex networks of terms, keywords, labels or tags that become co-chosen from some predefined semantic space. Modelling such complex networks, along with their empirical analysis, is a challenging task that provides a deeper understanding of their growth mechanisms [12,17,18].
In this Letter, we suggest an underlying mechanism that governs the growth of a network of concepts originating from the texts of preprints in physics submitted to e-print repository arXiv [19]. First, we perform an empirical analysis of this network and calculate its topological characteristics. We discuss the main network features and show that simultaneous account of two factors i.e., growth by blocks and preferential selection, gives an explanation of empirically observed properties. A detailed account of our analysis is to be published elsewhere [20].
We used the vocabulary of scientific concepts in the domain of physics that has been collected by the ScienceWISE.info platform [21] and refined by continuous updates by expert evaluations. The resulting ontology includes such concepts as Ferromagnetism, Quantum Hall Effect, Renormalization group, and thousands of others. To our knowledge, currently such a vocabulary is the most comprehensive vocabulary of this type in the domain of physics. The sample of articles we analysed consists of 36,386 entities submitted to arXiv during a single year 2013 that have been assigned to a single category during submission process and is in one-to-one correspondence with the data set being analyzed in [14,22,23]. For each of the articles, a set of its inherent concepts has been defined using the above mentioned vocabulary of concepts. In this way, we arrived at the data that are conveniently described as a a bipartite network consisting of the nodes of two types: articles 1 , 2 , . . . , N and concepts 1 , 2 , . . . , , each -node is linked to those -nodes that represent its inherent concepts. While the properties of the bipartite network and its one-mode projection into the space of articles were analysed in [22,23], here we concentrate on its one-mode projection into the space of concepts. Now, all -nodes that were connected to the same -node enter the network as a complete graph or clique. Hereafter, such a one-mode projection is called a network of concepts and is a subject of empirical analysis and modelling.
The main characteristics of the network of concepts constructed based on the data described in the former paragraph are given in the first line (denoted as 'empirical') of table 1. There, out of many network indicators, we display those that describe the most typical features addressed below. In particular, the empirically observed network of concepts is very dense: the density of links = 2 / ( − 1) = 7.66%. This number indicates that concepts are densely connected within a considered discipline: the authors who conduct research in physics, extensively use common terminology. One of the consequences is the high value of the mean node degree. Standard deviation of the node degree distribution indicates a high level of inhomogeneity among concept co-occurrence statistics. This can be also observed from the skewed shape of the histogram of node degree values ( ) as shown in figure 1a by grey discs. The tail of the histogram may be visually compared with a power-law function − with an exponent close to = 1. While this empirical network cannot be formally classified as the so-called dense network [24][25][26][27], it is much denser compared to other real networks [20]. Similar shapes of node degree distributions were found and declared to be robust for a few other analogous empirical networks [17,18]. Negative value 24001-2  of the assortative mixing by degrees = −0.28, defined as Pearson correlation coefficient between node degrees on both ends of links in the network, indicates that in the network of concepts, the high-degree nodes attract low-degree ones of a high extent. The presence of connectivity patterns is featured by comparatively high values of the mean clustering coefficient and global transitivity (cf. = = 1 for the complete graph and = = 0 for a tree). For a node of degree > 1, the clustering coefficient is a ratio of existing links between its neighbouring nodes to all possible connections between them, = 2 [ ( − 1)] −1 . In turn, the global transitivity is defined as a ratio between the number of closed triplets in the network and the total number or network triplets [28]. The difference between the two values, and , indicates specific topological features of the network. With quantitative measures of basic network features at hand, let us proceed with modelling a growth process that results in network topology similar to the empirically observed one.
We start with the Erdős-Rényi random graph [29] and Barabási-Albert preferential attachment [30] models. Both models allow us to generate uncorrelated networks with the same number of nodes and links as the empirical one. Therefore, the density of links and the average node degree coincide too. The discrepancies become evident with more in-depth analysis. Results of the network characteristics calculated for an ensemble average over 100 realizations for each model are shown in the 2nd and 3rd lines of table 1. The Erdős-Rényi random graph is much more homogeneous than the empirical network: the standard deviation is almost 40 times smaller than that for the empirical concept network, the maximal node degree max exceeds its average value by 12% only. This may be observed in figure 1a, where the corresponding histogram ( ) is shown by black discs. The Barabási-Albert model, that has growth and preferential attachment as key ingredients, better reproduces empirical network node degree heterogeneity: max exceeds by more than 300%, almost 20 times exceeds its value for Erdős-Rényi graph. However, the decay of ( ) is much faster than in the empirical network (see the red discs in figure 1a and the solid line that corresponds to ( ) ∼ − with the Barabási-Albert model decay exponent = 3 [30]). The discrepancies are even more pronounced when one considers connectivity patterns between nodes of different degrees. Similar to the Erdős-Rényi graph, the Barabási-Albert network is neither assortative nor disassortative, indicating the feature of the empirical network of concepts that cannot be captured by the models. The other feature that is not captured by the models is the difference between the average clustering coefficient and the global transitivity , even though the values for the Barabási-Albert model are closer to those for the empirical network than the ones for the Erdős-Rényi network.
To understand the possible mechanisms that lead to the concept network under consideration, let us develop a model that is capable of reproducing its empirically observed features. Doing so, we do not put as a primary goal reaching a high precision of reproducing the given set of metrics. Rather we are 24001-3 interested in a qualitative description of the main tendencies in network structure and their explanation by network generation mechanisms. The model of the network evolution that we suggest is based on the simultaneous account of two factors: growth by blocks and preferential selection. Consider a process with discrete time = 1 . . . N . At each time step, a new article that contains a block of concepts is generated. It joins the concept network as a complete graph of nodes. The article generation consists of two steps: (i) drawing the block size and (ii) selecting particular concepts to populate the block. Below, we choose an option when is drawn from the actual distribution of the number of concepts per article in the empirical data set while other options are discussed in [20]. Let us explain step (ii) more in detail. When a new article is generated at time > 1, the already existing data set consists of a set of − 1 articles A −1 and a set of −1 different concepts C −1 . The new article may contain some of the above −1 concepts as well as the novel concepts that are introduced for the first time. Within our model, we fix the probability of the -th concept of article to be a novel one, novel , = . Consequently, with probability 1 − a concept of the generated article is one of the already existing −1 concepts. Moreover, let us consider that the already existing concepts have different chances to be selected to populate an article: the more popular the concept is (among the first − 1 articles), it is more likely to be selected to populate the -th one. We call such a process a preferential selection. The probability exist , ( ) for the concept to be selected is proportional to the number of articles N −1 ( ) in which the concept has appeared: where C \ −1 is the subset of concepts C −1 excluding − 1 concepts selected for article and the denominator sums the number of times each concept from the set C \ −1 has appeared in all articles.
By the above described evolution mechanism, the concept network grows by adding cliques to the existing graph. At each time , once a new article of concepts is generated, it enters the concept network as a complete graph of nodes and ( − 1)/2 links. Thus, during its evolution, the following processes may be observed in a generated concept network: (i) addition of new nodes, (ii) appearance of links between new nodes and between new and already existing nodes, (iii) appearance of new links between previously unconnected existing nodes, which is important for generation of dense networks. We compare the main features of the network of concepts generated by the growth by blocks with preferential selection mechanism in the last line of table 1. As for the two previously described models, we display the values averaged over an ensemble of 100 network realizations. The number of articles generated in our simulations was set to be exactly the same as the number of articles (N = 36, 386) in the empirical data set. Fixing the number of articles does not ensure that the generated network will have the same number of nodes (concepts). The remaining free parameter of the model has been chosen = 8.8 · 10 −3 to give a reasonable value of the number of concepts , see [20] where other concept selection mechanisms were considered. As one can see from the table, now the modeled network of concepts possesses two features that Erdős-Rényi and Barabási-Albert models failed to reproduce: it is disassortative ( < 0) and its mean clustering coefficient and global transitivity differ from each other. The fact that the growth by blocks and preferential selection mechanism correctly grasp the main features of the network of concepts is further supported by the form of the node degree histogram, as shown by red squares in figure 1b. Now one observes characteristic decays in the regions of small and large values of . Black discs in the plot show an outcome of the modified model, when each block of concepts has a fixed size [20] that leads to an obvious sharp lower bond.
In the forthcoming publication [20] we will give a more detailed account of the suggested network evolution mechanism along with the analysis of its various modifications. Several conclusions are at hand to finalize this brief report. First of all, one should not go too far in trying to reach a one-to-one mapping between the features of the empirically observed network of concepts and the modeled one. Indeed, the model which selects new concepts at random, completely neglects their content-related characteristics. Rather, the goal is to reveal which processes in the network evolution are relevant to its generic features. As we show in this report, these are the growth by blocks and preferential selection. Moreover, our analysis shows that the observed network structure emerges as a synergetic effect of both of these factors: each of them alone does not lead to a satisfactory picture. The model suggested in this paper may be also of relevance in analysing the generating mechanisms for dense networks which are the subject of  ongoing interest [24][25][26][27].
This work was supported in part by the National Academy of Sciences of Ukraine, project KPKBK 6541030 (O.M. & Yu.H) and by the National Research Foundation of Ukraine, project 2020.01/0338 (M.K.).