Elsevier Data Showcase


Wednesday, May 19, 2021, 12:00pm to 1:00pm



Recording for Harvard Affiliates

We are delighted to announce that Harvard University and Elsevier have signed a Master Data Use Agreement (DUA) that will facilitate access by Harvard researchers to a large multidisciplinary and multimodal collection of  Elsevier owned or managed datasets, including full-text article content, citation data, and curated data from different domains, including pharmacology and engineering.  Under the terms of the Agreement, Harvard researchers will have the opportunity to access these datasets for research purposes at no cost.   

Experts from Elsevier will give overviews and “tours” of the datasets along with information about how to access the data, beginning with a General Overview of the datasets and what they can be used for, on Wednesday, May 19 at 12 noon ET. The session is open to Harvard-affiliated faculty, postdocs, students, and researchers.  Interested individuals can register online.

Speakers include:

Steve Watson will discuss ScienceDirect

Patrick Crisfulla will discuss PURE

Michael Magoulias will discuss SSRN

Daniel J Calton will discuss Scopus




Presented by Steve Watson, Director of Product Management

ScienceDirect offers 2500+ journals including Cell and Lancet, 39,000+ books and 220+ major reference works covering the largest available breadth of STEM areas such as Physical Sciences and Engineering, Life Sciences, Health and Social Sciences.

All academic subscribers of ScienceDirect can retrieve subscribed and open access journal and book content through the ScienceDirect APIs, in structured, full-text format for large-scale data mining projects. As part of the Harvard Data Science Collaboration you may propose projects that make use of non-subscribed content too. This would give you even more possibilities: to connect the dots between different datasets; to develop knowledge graphs using original research published in leading peer-reviewed journals and discover new insights and relationships; and to leverage foundational content published in books, review articles and reference works, as an accurate, comprehensive training data set, to create machine learning models such as neural networks to predict experimental outcomes.


A data mining project proposal should start with a focused research goal, e.g. improving energy efficiency in engineering processes or predicting chemical synthesis routes with reduced hazardous side products. If you have a hypothesis about extracting features to answer that research goal it should then be possible to find relevant articles and chapters to download for further processing and analysis.

We can provide a form for project proposals but would basically like to know: the research goal and approach, if any non-subscribed content is needed and how much, who would have access to it, what outputs will be created and shared.

Here is a useful list of frequently asked questions about using the API and our TDM policy: https://www.elsevier.com/about/policies/text-and-data-mining/text-and-data-mining-faq


Presented by Michael Magoulias, Director of Operations and Product Fulfillment

SSRN is a platform for early-stage research (preprints) that has been in existence for over 25 yrs. It was founded an economics professor, Mike Jensen, who early on saw the opportunities for digital communication of research ahead of the traditional publication process. Prior to the emergence of digital technologies, economists shared their research with colleagues vial letter or at conferences, so the discipline had an established practice of disseminating its research findings prior to peer review.

The founding of SSRN allowed for this dissemination to be conducted on a much larger and more rapid scale. Suddenly the entire research community could read the latest research as soon as it was posted on the site. Other social science disciplines quickly followed suit, and SSRN became a hub for research in Law, Finance, Accounting, Political Science, and Anthropology. No it embraces over 60 research disciplines spanning the full spectrum of inquiry in higher education.

The service was self-funding, and so needed to adopt a business model that kept the abilities to read and submit scholarship without charge, while sustaining its operations. SSRN therefore became a pioneer in the “freemium” business model. Charges only come into play at the institutional level. Harvard, was an early adopter and is one of our most valued customers.

There are two key services that SSRN provides, and Harvard makes use of both. First, there is the Working Paper Series product that allows organizations producing research to place it on SSRN’s Open Access site and distribute it via email alerts to a subscriber base of over 2.5 mil. registered users. Secondly, institutions are able make use of SSRN as an early research database through a subscription alerting service that allows users to choose the disciplinary areas they would like to receive updates about. Harvard currently has several active Research Paper Series, and a number of subscriptions to the alerting services that are unrestricted in terms of the number of potential users.




Presented by Patrick Crisfulla, VP PURE

What is it?

Pure is a Research Information Management (RIM) system used by around 300 research intensive institutions, primarily academic and gov't research institutions. It's typically setup by a single institution however there are cases where most research-intensive universities in a region participate in the same data systems (e.g. across Israel, Denmark, NL, Iceland, etc.)



For "researching research" - trends in research

An instance of Pure at a university provides a means to study the research of a particular institution in full depth, potentially being able to negotiate access to data that is not available in public record aggregated databases.

For diversity, inclusion, equity type of research, or other studies of research activity, large scale public sources such as Scopus would typically be the starting point for research. However partnering with universities to access university or faculty validated data from RIM systems might be a a further consideration for some types of research about research where more detailed HR data is needed to validate results.


Relational inter-linked data available per university, via structured API, with university permission.


Solution Description:



What's it used for?

Pure is used by institutions that need to solve these problems across a large research-intensive enterprise


1.Conduct assessments of the institution’s research

2.Management reporting on research activity of persons and teams for career talent development, tenure and advancement

3.Quickly identify internal talent on any given topic or concept across campuses to act on opportunities

4.Coordinate enterprise workflow involved in “get project funding” activities; btw. PIs and research admin support function team

5. Present a branded public research portal demonstrating research strengths, experience, and assets


What type of data does it contain?

a Pure RIM system dataset at some institutions contain a mix of HR data about research producing Persons

  • e.g. roles, positions, unique identifiers, sometimes birth/gender, as well as activities (teaching, positions held, etc.)

and also related research inputs and outputs per person, e.g. research grant applications, awarded grants, research projects, research outputs preprints, publications, citations, patents) and (in some countries) research impact statements.


In addition it's often a complete map of the research institution itself listing

  • research institutes, labs, departments, schools, as well as available assets, such as facilities, equipment, patents owned, etc.

Whereas public record aggregated sources of data may have limitations (E.g. Pubmed or other similar only contains selective articles from a given set of participating journals, and can be partially linked to institutions and persons) the dataset in university RIM systems have in some countries been validated and confirmed as complete across all of a university's schools, because they are used for faculty evaluation, national institution assessment (block grant funding to unis from gov't) etc.

  • This distinction between the record of an institution's publications in a public source vs. a complete view of publications in a RIM system is more pronounced if studying research trends in regions where local language publishing is a factor (if that is not completely captured in the global aggregated database).

Who owns the data?

The datasets in these RIM systems are not owned by Elsevier, they ae owned by each individual research institution. Elsevier is the software solution provider and in some cases database hosting provider.

By partnering with universities individually, or via gov'ts that operate Pure across a research consortia of institutions, one might gain more detailed views possible.

Elsevier would be able to describe the datasets in detail, provide some support on regions and countries that capture more granular detail, and provide technical assistance on extracting data from Pure API (after university permissions would be gained by interested researchers)


Scopus Database: Coverage, Metadata Attributes, Profiles, Data Formats + Large-Scale Real-World Use Cases


The brief presentation on Scopus, expected to be 15-20 minutes in length, will focus on the database’s global coverage, data model and metadata attributes, author and affiliation profiling, and data formats available, In addition, time permitting, it will provide a few real-world examples of large-scale implementations that use Scopus data to meet key information needs.


Specifically, the overview will cover the following:


  • Elsevier data assets overall
  • Scopus global journal and conference proceeding coverage and structure
  • Data model and metadata associated with 82.5M Scopus records
  • Curated author and affiliation profiling in Scopus
    • Advanced disambiguation and deduplication processed used
  • Available data formats
  • Real-world large-scale use cases:

§ Japanese Cabinet Office—Digital Transformation

§ Netherlands open science initiative

§ NIH Integrity Check grants fraud detection project

§ New Jersey Economic Development portal