Text and Data Mining (TDM)
Text and data mining (TDM) are research techniques that use computational tools to identify and extract relevant information or patterns from large data sets or from text-based digital content.
As the use of TDM for research gains popularity, a number of challenges are presented. There are legal, ethical and logistical issues that researchers must consider when selecting sources of text and/or data for analysis. This guide was developed to help LBNL researchers identify resources in our collections that may be available to use for TDM projects. It also includes sources that are freely available online. Many scholarly publishers, databases, and products offer APIs (Application Programming Interface) to more powerfully extract data.
Appropriate use of licensed resources:
Most of the library’s electronic resources are governed by license agreements that limit use to the LBNL community or to individuals who are physically present at LBNL facilities.
- Each user is responsible for ensuring that he or she uses these products solely for noncommercial, educational, scholarly or research use. Systematic downloading, distribution of content to non-authorized users or indefinite retention of substantial portions of information is strictly prohibited
- The use of software such as scripts, agents, or robots, is generally prohibited and may result in loss of access to these resources for the entire LBNL community.
- American Chemical Society Text Mining: Text mining of full-text ACS content, available by request The content is delivered for local storage and analysis; users may use tools of their choice for analysis. The result format is text or PDF. How to register: LBNL users should email email@example.com. Users must agree to and sign an agreement with ACS Limitations: No limitations on volume, but users will need to provide information on the specific content they would like to mine (journal title and date range, or a list of DOIs) Contact for technical questions: firstname.lastname@example.org
- arXiv API: Gives programmatic access to all of the arXiv data, search and linking facilities. API calls are made using any web-enabled client (e.g. a web browser) to make an HTTP GET or POST request to an appropriate URL. API users can use the programming language of their choice. The results format is Atom. Free to use, no registration or API key required. No stated limitations, but high-volume users should contact arXiv at http://arxiv.org/help/contact Contact for technical questions: arXiv Google Group. For more information: http://arxiv.org/help/api/index
- BioMed Central API: The API retrieves: 1) BMC Latest Articles; 2) BMC Editors picks; 3) Data on article subscription and access; 4) Bibliographic search data. It’s accessed through RESTful interface, queries are made as HTTP GET requests. The result formats are JSON and Prism Aggregate (PAM). It is free to access, no registration required. Contact for technical questions: email@example.com. For more information: https://www.biomedcentral.com/getpublished/indexing-archiving-and-access-to-data/api
- CrossRef REST API: The API allows access to metadata records for over 75 million scholarly works that have CrossRef DOIs, covering around 5000 publishers. Can be used for text- and data-mining, checking against funder mandates, and to obtain metadata in a variety of representations. It uses the RESTful interface and the results format is JSON. No registration required. Contact for technical questions: firstname.lastname@example.org. For more information: https://www.crossref.org/services/metadata-delivery/rest-api/
- Elsevier (ScienceDirect, Scopus): For LBNL-subscribed Elsevier journals and books on the ScienceDirect full-text platform. Get a developer account to use the Elsevier APIs for non-commercial purposes, and make sure to query the API from LBNL IPs to ensure full access. You can also use the APIs to access citation data and abstracts from scholarly journals indexed by Scopus. For more information, see Text Mining documentation.
- HathiTrust Data API: This can be used to retrieve content (page images, OCR, and in some cases whole volume packages), and metadata for HathiTrust Digital Library volumes. The RESTful interface is used with the result formats: XML, JSON or binary depending on the resource queried. Two methods of access: via a Web client, requiring authentication (users who are not members of a HathiTrust partner institution must sign up for a University of Michigan “Friend” Account), or programmatically using an access key that can be obtained at http://babel.hathitrust.org/cgi/kgs/request. It is not meant for large-scale retrieval of data. Contact for technical questions: email@example.com, https://www.hathitrust.org/feedback. For more information: https://www.hathitrust.org/data_api
- IEEE Xplore API: Provides flexible query and retrieval of metadata records for more then 4 million documents comprising IEEE journals, conference proceedings, and technical standards. It’s accessed via HTTP requests using structured URL queries with results formats of JSON and XML. Follow the steps at https://developer.ieee.org/getting_started. Maximum of 200 results may be retrieved in a single query. A query term can only contain a maximum of 10 words. Contact for technical questions: firstname.lastname@example.org. For more information: https://developer.ieee.org/
- JSTOR Data for Research: Data for Research allows you to download word frequencies, citations, key terms, and n-grams for up to 25,000 JSTOR articles at a time, or to easily submit requests for larger sets of articles. See also:
- JSTORr, a package of simple functions in R to work with DFR output.
- JSTOR’s Text Analyzer, a reverse search engine that analyzes documents that you upload (your own, or other articles) to find related materials in JSTOR.
- Public domain and OA datasets include full OCR text from early journals and current academic press open access titles.
- Nature Blogs API: It is a blog tracking and indexing service; tracks Nature blogs and other third-party science blogs. It uses the RESTful interface, queries are made as HTTP GET requests. The default result format is JSON, some queries return Atom/RSS. It’s free to register. It is limited to: 2 calls per second; 5,000 calls per day; RSS results are limited to 100 items maximum. Contact for technical questions: email@example.com. For more information: http://www.nature.com/developers/documentation/api-references/blogs-api/
- Nature OpenSearch API: It’s a ibliographic search service for Nature content. Accessed via RSS, JSON, ATOM, SRU XML, TURTLE, depending on interface used. The result formats are: REST API with two interfaces: 1) OpenSearch standard interface using keyword searches; 2) SRU search interface using CQL structured queries. Free to register, Results are served in pages of 25 records. Additional records can be retrieved by paging through the result set. The page size can be varied and is capped at 100 records. Contact for technical questions: firstname.lastname@example.org. For more information: http://www.nature.com/developers/documentation/api-references/opensearch-api/
- ORCID API: It queries and searches the ORCID (Open Researcher and Contributor ID) researcher identifier system and obtain researcher profile data using the RESTful interface, with result formats of HTML, XML, or JSON. Two options to register: 1) Users can access the Public API, which only returns data marked as “public”; 2) Become an ORCID member to receive API credentials: see here. Data retrieved through Public API is limited. Contact for technical questions: https://orcid.org/help/contact-us. For more information: https://orcid.org/organizations/integrators/API
- PLOS (Public Library of Science): Python tool for downloading/updating/maintaining a repository of all PLOS XML article files. Use this program to download all PLOS XML article files instead of doing web scraping. See also: PLOS APIs to query content from the seven open-access peer-reviewed journals from the Public Library of Science using any of the twenty-three terms in the PLOS Search.
- PubMed and NLM: Data Guide: A guide to using this API, called E-Utilities, to access citation data for medical journal literature in PubMed and other NCBI databases, including the National Library of Medicine Catalog, MeSH, Gene, and PMC (PubMed Central).
- ScienceDirect API: There are multiple APIs available for different use cases, including text mining of full-text content, search widgets, displaying journal or book level data, federated searching, and indexing. Access varies depending on use case, as does the result format. Result format: varies, depending on use case. Free to register. Contact email@example.com to receive an API key. The limitations varies depending on use case. Contact for technical questions: firstname.lastname@example.org. For more information: https://dev.elsevier.com/sd_apis.html
- Scopus APIs: There are multiple APIs available for different use cases, including displaying publications on a website, showing cited-by counts on a website, federated searching, populating repositories with metadata, populating VIVO profiles, and others. Accessed varies depending on use case, as does the result format. Registration is free, but some functionality requires LBNL affiliation. The limitations varies depending on use case. Contact for technical questions: email@example.com. For more information: https://dev.elsevier.com/; https://dev.elsevier.com/sc_apis.html
- Springer’s Digital Content: Individual researchers are encouraged to download subscription and open access content for TDM purposes directly from the SpringerLink platform. No registration or API key is required. Full-text content can be accessed easily and programmatically at friendly URLs based on the content’s Digital Object Identifier (DOI).” via Springer’s Text and Data Mining Policy.
- Web of Science Web Services: This is a bibliographic search service that allows automatic, real-time querying of records; primarily for populating an institutional repository. Access is through the SOAP protocol. The result format is XML. Registration is free. Extractable data is limited to particular fields, databases, and file depths, also depends on host institution’s subscription. Contact for technical questions: https://support.clarivate.com/s/. For more information: http://wokinfo.com/products_tools/products/related/webservices/
- Wiley Text and Data Mining: This allows text- and data-mining access to content in the Wiley Online Library and is accessible via CrossRef’s TDM service; RESTful interface. The result format is JSON. Users will encounter a click-through agreement and will receive a Client API Token, which is needed when requesting full text of articles. Rate-limits are implemented through CrossRef rate-limiting headers, exact limitations not specified. Contact for technical questions: TDM@wiley.com; firstname.lastname@example.org for support using the CrossRef TDM service. For more information: http://olabout.wiley.com/WileyCDA/Section/id-826542.html
If resource not listed, contact the LBNL Library at email@example.com