The blog journal on science policy.

Methodology for analysing the number of authors per article using meta data from Scopus.

Methodology

We conducted an Advanced search in Scopus – a bibliographic data base with more than 22.000 peer-reviewed journals (Scopus, 2017) – between 27th of February and 7th of March 2017. The results of the search were defined by an algorithm (see below Scopus Algorithm for an example on the subject area Agricultural and Biological Sciences – acronym: AGRI) that includes

  • a time period of published items beginning in 2010 to 2016 (inclusive)
  • the document types (articles or reviews),
  • and a quantitative limitation regarding the publication output (articles by the 20 authors with the most Scopus listed articles in every subject area).

The selection of articles ruled out overlaps in order to avoid that articles are analysed for more than one category. Please notice that the results are only representative for the select group of authors. It is possible that articles are reviews and vice versa at the same time.

The data was analysed for the 27 subject areas given by Scopus (2017b). Journals decide on their own to which subject areas they belong in the data base, so we had no influence on that and neither does Scopus. A search conducted later than ours, may result in more articles as some articles are added to Scopus after our data collection, also for earlier years. Further, the analyses were performed with original data with no changes expect the ones given in the R code (see below).

Statistical analyses were performed with R (R Core Team, 2017) using a word count function (see below R Code) and a package for quantitative analysis (Rinker, 2013). The XXXX is a proxy for the four characters code of the Scopus subject areas (Scopus, 2017b). As authors can add suffixes to their last names the word count function might result in slightly inaccurate results, although we already tried to exclude those cases which occurred. There is no guideline for that by Scopus.

 

Scopus Algorithm

DOCTYPE ( re ) OR DOCTYPE ( ar )

AND SUBJAREA ( AGRI )

AND ( EXCLUDE ( SUBJAREA , “BIOC ” ) OR EXCLUDE ( SUBJAREA , “PHYS ” ) OR EXCLUDE ( SUBJAREA , “MEDI ” ) OR EXCLUDE ( SUBJAREA , “ENVI ” ) OR EXCLUDE ( SUBJAREA , “EART ” ) OR EXCLUDE ( SUBJAREA , “IMMU ” ) OR EXCLUDE ( SUBJAREA , “CHEM ” ) OR EXCLUDE ( SUBJAREA , “VETE ” ) OR EXCLUDE ( SUBJAREA , “PHAR ” ) OR EXCLUDE ( SUBJAREA , “NEUR ” ) OR EXCLUDE ( SUBJAREA , “SOCI ” ) OR EXCLUDE ( SUBJAREA , “ENGI ” ) OR EXCLUDE ( SUBJAREA , “MATE ” ) OR EXCLUDE ( SUBJAREA , “MATH ” ) OR EXCLUDE ( SUBJAREA , “ECON ” ) OR EXCLUDE ( SUBJAREA , “COMP ” ) OR EXCLUDE ( SUBJAREA , “ARTS ” ) OR EXCLUDE ( SUBJAREA , “CENG ” ) OR EXCLUDE ( SUBJAREA , “ENER ” ) OR EXCLUDE ( SUBJAREA , “DECI ” ) OR EXCLUDE ( SUBJAREA , “HEAL ” ) OR EXCLUDE ( SUBJAREA , “NURS ” ) OR EXCLUDE ( SUBJAREA , “PSYC ” ) OR EXCLUDE ( SUBJAREA , “BUSI ” ) OR EXCLUDE ( SUBJAREA , “DENT ” ) OR EXCLUDE ( SUBJAREA , “MULT ” ) )

AND ( LIMIT-TO ( PUBYEAR , 2016 ) OR LIMIT-TO ( PUBYEAR , 2015 ) OR LIMIT-TO ( PUBYEAR , 2014 ) OR LIMIT-TO ( PUBYEAR , 2013 ) OR LIMIT-TO ( PUBYEAR , 2012 ) OR LIMIT-TO ( PUBYEAR , 2011 ) OR LIMIT-TO ( PUBYEAR , 2010 ) )

AND ( LIMIT-TO ( AU-ID , “NAMEAUTHOR-ID ) OR LIMIT-TO ( AU-ID , “NAMEAUTHOR-ID ) OR LIMIT-TO ( AU-ID , “NAMEAUTHOR-ID ) [… and so on]

R Code

library(ggplot2)
library(grid)
library(gridExtra)
library(qdap)
library(extrafont)

# Import data (XXXX)
XXXX <- read.csv(“~/XXXX.csv”, stringsAsFactors=F)

# Delete name suffixes for comma separated counting
XXXX$Authors <- mgsub(c(“Jr.,”,”Sr.,”,”I,”,”II,”,”III,”,”IV,”,”V,”),””, XXXX$Authors)

# Count number of authors by comma separation
AuthCountXXXX <- sapply(gregexpr(“,”, XXXX$Authors), length) + 1

# Create dataframe
AuthCountPerYearXXXX <- data.frame(AuthCountXXXX)
rm(AuthCountXXXX)

# Add year
AuthCountPerYearXXXX$YearXXXX <- as.numeric(XXXX$Year)

# Model for slope in AuthCount
fitXXXX <- lm(AuthCountPerYearXXXX$AuthCount ~ AuthCountPerYearXXXX$Year)

# Plot graph
pPlot <- ggplot() + theme_bw() +
theme(axis.text.x=element_text(colour=”darkgrey”),
axis.text.y=element_text(colour=”darkgrey”),
axis.ticks = element_line(colour = “lightgrey”),
panel.background = element_rect(fill = NA),
panel.grid.major.y = element_line(colour = “lightgrey”),
panel.grid.major.x = element_blank(),
panel.grid.minor  = element_blank(),
panel.border = element_rect(colour = “lightgrey”)) +
coord_cartesian(ylim=c(-1,25)) + ylab(“”) + xlab(“”) +
scale_x_continuous(breaks=c(2010,2011,2012,2013,2014,2015,2016)) +
geom_boxplot(data=AuthCountPerYearAGRI,
aes(x=YearAGRI, y=AuthCountAGRI, group=YearAGRI), na.rm = T) +
stat_smooth(data=AuthCountPerYearAGRI,
aes(x=YearAGRI, y=AuthCountAGRI), method = “lm”, col = “#e4a50d”)

xaxis <- textGrob(label = “Year”, hjust = 2, vjust =26, gp=gpar(fontfamily =”Lato-Regular”))
yaxis <- textGrob(label = “Number of authors per article”, hjust =-0.4, vjust=2.5,
rot = 90, gp=gpar(fontfamily =”Lato-Regular”))
pAGRI <- grid.arrange(pPlot, right = xaxis, left = yaxis)

Acknowledgements

We want to thank Claus Dalchow and Jana Rumler for reviewing earlier drafts of our approach and methodology.