How to Schedule Python Scrapy Spiders on Heroku using Custom Clock Process for Free

Published in

Analytics Vidhya

4 min readJun 29, 2021

Have you ever tried checking for the price drop of an iPhone that you crave to buy every couple of hours during the Amazon Great Indian Sale?

Do you have high-priced products in your wishlist and wish to be notified on it’s price drop?

Do you wish to be notified when a stock/crypto falls below a specific price?

Then, you might want to build, deploy and periodically schedule a scraper to scrape data from the target website for free.

During early stages of development, we run and schedule Scrapy spiders in our local machines easily, but eventually we want to deploy and run our spiders in the cloud.

In this article, I would be explaining to you how to deploy your Scrapy spiders and periodically schedule them using a custom clock process on Heroku for free.

Prerequisite

This tutorial expects you to be ready with your Scrapy project to be able to deploy your spider to Heroku.

Let’s Get Started

We need to install a couple of modules required for deploying and running Scrapy spiders:

Run pip install scrapyd to install scrapyd daemon.
Run pip install git+https://github.com/scrapy/scrapyd-client.git to install scrapyd-client.
Run pip install herokuify_scrapyd to install herokuify_scrapyd python module which eases deploying Scrapy spiders to Heroku.

You need to specify Python package dependencies of your Scrapy project on Heroku via pip, create a requirements.txt file in your project root directory by running pip freeze > requirements.txt.

Getting Started with Heroku

Create an account on Heroku.
Install Heroku CLI which lets you create and manage Heroku apps directly from the terminal.
cd to your project folder and run heroku login -i to login into your Heroku account.
Run heroku create to create an Heroku app.

$heroku createCreating app… done, ⬢ <HEROKU_APP_NAME>
Created http://<HEROKU_APP_NAME>.herokuapp.com/ | git@heroku.com:<HEROKU_APP_NAME>.git

Run heroku git:remote -a <HEROKU_APP_NAME> to add a remote to your Heroku app.

Following creation of Heroku app, you need to edit your scrapy.cfg file as below:

scrapy.cfg file used with Scrapy projects deployed on Heroku

Custom Clock Process

Heroku Scheduler is a free add-on that enables scheduling simple tasks every 10 minutes, every hour, or every day, at a specified time. But what if you want to run your spider every 5 seconds or three times a day or at a very specific time? In these unique and complicated scenarios scheduling your spiders using a custom clock process provides greater control and is recommended for your spiders in production.

Bear in mind that scheduler add-on doesn’t guarantee execution of jobs at their scheduled time and in very rare instances a job might be skipped or may run twice.

Now, let’s create a custom clock process to periodically schedule our Scrapy spider on Heroku. We will be using TwistedScheduler from the APScheduler Python library as Scrapy is built on top of Twisted networking framework. APScheduler is a lightweight, in-process task scheduler which provides a clean, easy-to-use scheduling API.

Let’s begin with installing modules required for scheduling by running pip install pytz and pip install apscheduler. Re-run pip freeze > requirements.txt to update your requirements.txt file.

Now, create a scheduler.py file as below in your project root directory.

APScheduler cron trigger that schedules our Scrapy spider on Heroku

send_request() function makes a POST request to schedule.json Scrapyd JSON API. Your Scrapy project name and spider name are sent as part of the post request.

Refer this doc for more details on the options provided by APScheduler cron trigger.

Create a Procfile in your project root directory explicitly declaring commands to be executed to start the app. Procfile for your Scrapy project would look like this:

Procfile used with Scrapy project

Create runtime.txt file specifying your Python runtime as below in your project root directory:

python-3.7.10

Deploying Spider to Heroku

Run git init, git add ., git commit -m <COMMIT_MESSAGE> to initialize a local Git repository and commit your application code to it.
Run git push heroku master to deploy your app to Heroku.

After deploying your spider to Heroku, run heroku ps:scale clock=1 to scale the clock process to single dyno thereby avoiding scheduling duplicate jobs.

And voila! You have now successfully deployed your Scrapy spider to Heroku. You can check the logs by running heroku logs --tail and you will be able to find a POST request that was sent via scheduler.py file.

Now, if you go to http://<HEROKU_APP_NAME>.herokuapp.com, you should see the Scrapyd welcome page where you can find your pending, running and finished jobs and you can check logs for the same.

To stop your scheduler entirely, run heroku ps:scale clock=0 to scale down your clock process to 0.

NOTE: Heroku Free and Hobby plan is suitable for non-commercial apps and you might need to upgrade to other plans for scheduling your spiders in production.

Congratulations! You have made it to the end of the tutorial and I hope this helps.

Source Code: You can refer to my Amazon Price Tracker Scrapy spider scheduled periodically on Heroku using a custom clock process. It tracks availability and price of an amazon product that you wish to buy and notifies you through email on price drop!

Analytics Vidhya

How to Schedule Python Scrapy Spiders on Heroku using Custom Clock Process for Free

Prerequisite

Let’s Get Started

Getting Started with Heroku

Custom Clock Process

Deploying Spider to Heroku

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Published in Analytics Vidhya

Written by Yashashree Suresh

Responses (2)