Populating Scrapy Items with Data from a Pandas DataFrame

===========================================================

In this article, we’ll explore how to populate Scrapy items with data from a pandas DataFrame. We’ll provide a step-by-step guide on how to achieve this using Scrapy’s start_requests method and the .to_dict() function from pandas.

Introduction

Scrapy is an open-source web scraping framework for Python that allows you to easily extract data from websites. One of its powerful features is the ability to populate items with data retrieved during the crawling process. In this article, we’ll show how to use a pandas DataFrame as a source of data for populating Scrapy items.

Background

Before diving into the solution, let’s cover some background information on Scrapy and pandas:

Scrapy: A web scraping framework that allows you to easily extract data from websites. It provides an easy-to-use API for extracting data, processing it, and storing it in a database or other storage system.
Pandas: A powerful library for data manipulation and analysis in Python. It provides data structures and functions for efficiently handling structured data, including tabular data such as spreadsheets and SQL tables.

Solution Overview

The solution involves the following steps:

Create a pandas DataFrame containing the geographic information you want to use.
Use Scrapy’s start_requests method to populate the items with data from the pandas DataFrame.
In the callback function for the start_url, retrieve the item dictionary from response.meta.
Use this dictionary to populate the fields of the Scrapy item.

Solution

Step 1: Create a Pandas DataFrame

First, create a pandas DataFrame containing the geographic information you want to use:

import pandas as pd

# Create a sample dataframe with city, latitude, and longitude data
data = {
    "City": ["Roma", "Napoli"],
    "Latitude": [40.85, 41.53],
    "Longitude": [14.30, 12.30]
}
df = pd.DataFrame(data)

# Print the dataframe
print(df)

Output:

     City  Latitude  Longitude
0    Roma   40.85   14.30
1  Napoli   41.53   12.30

Step 2: Define the Scrapy Item

Next, define the Scrapy item that will be populated with data from the pandas DataFrame:

import scrapy

class TestItem(scrapy.Item):
    Price = scrapy.Field(output_processor=MapCompose(unicode.strip))
    City = scrapy.Field(serializer=str)
    Latitude = scrapy.Field(serializer=str)
    Longitude = scrapy.Field(serializer=str)

Step 3: Implement the `start_requests` Method

In this step, we’ll implement the start_requests method to populate the items with data from the pandas DataFrame:

def start_requests(self):
    # Create a dictionary where City is the key and city info is the value
    d = df.set_index('City').to_dict()

    pattern = re.compile(r"http://www.immobiliare.it/(\w+)/")
    for url in self.start_urls:
        city = pattern.search(url).group(1)
        yield scrapy.Request(url, meta={"info": d[city]})

In this code:

We create a dictionary d where City is the key and city info is the value.
We iterate over each URL in the start_urls list.
For each URL, we search for the city using a regular expression pattern.
We yield a new Scrapy request with the URL as the target URL and metadata containing the dictionary with city information.

Step 4: Populate the Fields of the Scrapy Item

In this final step, we’ll populate the fields of the Scrapy item using the data from the pandas DataFrame:

def parse_start_url(self, response):
    info = response.meta["info"]
    for selector in response.css('div.content'):
        l = ItemLoader(item=TestItem(), selector=selector)
        l.add_css('Price', '.price::text')
        l.add_value('City', info['City'])
        l.add_value('Longitude', info['Longitude'])
        l.add_value('Latitude', info['Latitude'])
        yield l.load_item()

In this code:

We retrieve the item dictionary from response.meta.
For each element in the response, we create an ItemLoader instance.
We add a CSS selector to extract the price and populate it with the .price::text value.
We use the add_value method to set the values of ‘City’, ‘Longitude’, and ‘Latitude’ fields using data from the dictionary.

Example Use Case

Here’s an example use case that demonstrates how to use the solution:

Suppose we have a Scrapy project with a CrawlSpider that extracts data from a website. We want to populate the items with geographic information stored in a pandas DataFrame.

First, create the pandas DataFrame:

import pandas as pd

data = {
    "City": ["Roma", "Napoli"],
    "Latitude": [40.85, 41.53],
    "Longitude": [14.30, 12.30]
}
df = pd.DataFrame(data)

Next, define the Scrapy item:

import scrapy

class TestItem(scrapy.Item):
    Price = scrapy.Field(output_processor=MapCompose(unicode.strip))
    City = scrapy.Field(serializer=str)
    Latitude = scrapy.Field(serializer=str)
    Longitude = scrapy.Field(serializer=str)

Then, implement the start_requests method:

def start_requests(self):
    d = df.set_index('City').to_dict()
    for url in self.start_urls:
        city = pattern.search(url).group(1)
        yield scrapy.Request(url, meta={"info": d[city]})

Finally, populate the fields of the Scrapy item:

def parse_start_url(self, response):
    info = response.meta["info"]
    for selector in response.css('div.content'):
        l = ItemLoader(item=TestItem(), selector=selector)
        l.add_css('Price', '.price::text')
        l.add_value('City', info['City'])
        l.add_value('Longitude', info['Longitude'])
        l.add_value('Latitude', info['Latitude'])
        yield l.load_item()

Run the Scrapy project and verify that the items are populated with geographic information from the pandas DataFrame.

Last modified on 2024-07-29