Understanding Web-Scraping from Fragments Menu Using R and JavaScript Libraries

Understanding Web-Scraping from Fragments (#) Menu

Web-scraping is the process of extracting data from websites using specialized algorithms and software. In this article, we will explore how to web-scrape data from fragments menu marked with #. Specifically, we’ll discuss a common issue when working with such menus and provide a solution using R and several popular libraries.

Introduction

Web-scraping can be challenging due to the dynamic nature of websites. Some websites use JavaScript to load content dynamically, making it difficult for web-scrapers to retrieve data. In addition, some websites may use iframes or other techniques to obscure their content. Fragments marked with # are a common issue when working with these types of menus.

Understanding Fragments (#) Menu

When you visit a website, the browser first loads the HTML structure of the page. The fragments menu is typically located at the bottom of the page and contains links to various sections of the site. These links are usually not visible by default but can be accessed using the # fragment.

To access these fragments, you need to use JavaScript that runs on the client-side. This can make web-scraping more difficult because it requires simulating user interactions or loading the page’s content dynamically.

The Issue with Fragments (#) Menu

In the case of the provided Stack Overflow question, the author is trying to scrape data from a #fragment; menu using R and the rvest library. However, the issue arises when using the html_elements function from rvest, which only returns elements on the visible page.

To access fragments marked with #, you need to use JavaScript that runs on the client-side to load the content dynamically. This can be done by using a headless browser or by manually loading the JavaScript code and executing it.

Solution Using R

One approach to solve this issue is to call an API endpoint that returns all listings in the fragment menu. In the provided answer, we use the jsonlite library to make an HTTP request to the API endpoint.

Here’s a breakdown of how this works:

  • We create a link to the API endpoint: link <- "https://www.neurosynth.org/api/analyses/724/studies?dt=1"
  • We use jsonlite::read_json to read the JSON data returned by the API endpoint.
  • Inside the loop, we use read_html from rvest to parse the HTML content of each link in the response.
  • We then extract the necessary data using various html_node and html_attr functions.

Here’s some sample code that demonstrates this process:

library(jsonlite)
library(purrr)
library(rvest)

link <- "https://www.neurosynth.org/api/analyses/724/studies?dt=1"
data <- jsonlite::read_json(link)$data

df <- map_dfr(data, ~ {
  node <- read_html(.x[[1]]) %>% html_node("a")
  data.frame(
    title = node %>% html_text2(),
    webpage = node %>% html_attr("href") %>% url_absolute(link),
    authors = .x[[2]],
    journal = .x[[3]],
    loading = .x[[4]]
  )
})

API Endpoints and Swagger Files

The provided answer also mentions that the swagger.json file is still working but doesn’t list all routes. This suggests that there may be multiple API endpoints available for accessing data from fragments menu marked with #.

To find these API endpoints, you can use tools like Postman or curl to test different URLs.

Here’s an example of how to use curl to test the swagger.json file:

curl -X GET 'https://neurosynth.org/api/swagger.json'

This will return a JSON representation of the API endpoint.

Additional Resources

If you’re new to web-scraping or want to learn more about working with fragments menu marked with #, here are some additional resources:


Last modified on 2025-04-10