Skip to main content

Embed Scrapy in WSGI Application

WSGI and Scrapy

A common question on Scrapy Stackoverflow is "How to use Scrapy with Flask, Django, or any other Python web framework?" Most are used to using the Scrapy’s generated projects and cli options, which make crawling a breeze, but are confused when trying to integrate Scrapy into a WSGI web framework. A common traceback encountered is ReactorNotRestartable, which stems from the underlaying Twisted framework. This occurs because, unlike asyncio or Tornado, Twisted’s eventloop/reactor cannot be restarted once stopped (the reason is a bit out of scope). So it becomes apparent that the trick to integrating Scrapy and WSGI frameworks involves being able to tame Twisted. Luckily, integrating async Twisted code with synchronous code has become quite easy and is only getting easier. In this post, the following will be demonstrated:

  • Embed a crawler in a WSGI app and run it using Twisted’s twist web WSGI server.
  • Embed a crawler in a WSGI app and run it any WSGI server (example: gunicorn, uwsgi, or hendrix)

Requirements

  • Python 2.7+
  • Twisted 17+
  • Scrapy 1.4+
  • Crochet 1.90+
  • Any WSGI compatible web framework (Flask, Django, Bottle, etc)

Optional Requirements - The following packages are used in the examples below, but any WSGI compatible framework and WSGI server are sufficient.

  • Flask
  • Gunicorn

Git Repo

To make life easy, a git repository has been created to provide all the code that will be discussed.

git clone https://github.com/notoriousno/scrapy-flask.git

Quote Spider

Let’s setup a quick project structure. This will be a bit different from those accustomed to a traditional Scrapy project structure, but not by much. First, let’s create a file (quote_scraper.py) that will hold a spider that scrapes http://quotes.toscrape.com.

import re
import scrapy

class QuoteSpider(scrapy.Spider):

    name = 'quote'
    start_urls = ['http://quotes.toscrape.com']
    quotation_mark_pattern = re.compile(r'“|”')

    def parse(self, response):
        quotes = response.xpath('//div[@class="quote"]')
        for quote in quotes:
            # extract quote
            quote_text = quote.xpath('.//span[@class="text"]/text()').extract_first()
            quote_text = self.quotation_mark_pattern.sub('', quote_text)

            # extract author
            author = quote.xpath('.//span//small[@class="author"]/text()').extract_first()

            # extract tags
            tags = []
            for tag in quote.xpath('.//div[@class="tags"]//a[@class="tag"]/text()'):
                tags.append(tag.extract())

            # append to list
            # NOTE: quotes_list is passed as a keyword arg in the Flask app
            self.quotes_list.append({
                'quote': quote_text,
                'author': author,
                'tags': tags})

        # if there's next page, scrape it next
        next_page = response.xpath('//nav//ul//li[@class="next"]//@href').extract_first()
        if next_page is not None:
            yield response.follow(next_page)

A quick summary of what this spider does: Scrape quotes.toscrape.com, extract the quote, author, and tags into a dict (self.quotes_list) then scrape the next page, if one is available. For those wondering where self.quotes_list came from, it’s a keyword arg that gets passed into the spider object (this will be discussed further when a WSGI app is created). Commonly, stats would be stored in a database, but for demonstration purposes, I’ll show you a clever way to use the properties of a list/dict to store values. self.quotes_list will simply be a list that contains relevant data that we will later JSON-ify and return to the end user.

WSGI Web App

Let’s embed CrawlerRunner to run the QuoteSpider, created in the previous section, within a Flask application (you can use Django, Bottle, Cherrypy, etc. Flask is just very common). Let’s create two endpoints, /crawl to actually scrape and /results that will provide the results of the scrape.

import json
from flask import Flask
from scrapy.crawler import CrawlerRunner
from quote_scraper import QuoteSpider

app = Flask('Scrape With Flask')
crawl_runner = CrawlerRunner()      # requires the Twisted reactor to run
quotes_list = []                    # store quotes
scrape_in_progress = False
scrape_complete = False

@app.route('/crawl')
def crawl_for_quotes():
    """
    Scrape for quotes
    """
    global scrape_in_progress
    global scrape_complete

    if not scrape_in_progress:
        scrape_in_progress = True
        global quotes_list
        # start the crawler and execute a callback when complete
        eventual = crawl_runner.crawl(QuoteSpider, quotes_list=quotes_list)
        eventual.addCallback(finished_scrape)
        return 'SCRAPING'
    elif scrape_complete:
        return 'SCRAPE COMPLETE'
    return 'SCRAPE IN PROGRESS'

@app.route('/results')
def get_results():
    """
    Get the results only if a spider has results
    """
    global scrape_complete
    if scrape_complete:
        return json.dumps(quotes_list)
    return 'Scrape Still Progress'

def finished_scrape(null):
    """
    A callback that is fired after the scrape has completed.
    Set a flag to allow display the results from /results
    """
    global scrape_complete
    scrape_complete = True


if __name__=='__main__':
    from sys import stdout
    from twisted.logger import globalLogBeginner, textFileLogObserver
    from twisted.web import server, wsgi
    from twisted.internet import endpoints, reactor

    # start the logger
    globalLogBeginner.beginLoggingTo([textFileLogObserver(stdout)])

    # start the WSGI server
    root_resource = wsgi.WSGIResource(reactor, reactor.getThreadPool(), app)
    factory = server.Site(root_resource)
    http_server = endpoints.TCP4ServerEndpoint(reactor, 9000)
    http_server.listen(factory)

    # start event loop
    reactor.run()

If you were to run this script, a Twisted WSGI server will start and serve the app on http://localhost:9000. For lack of better words, Flask is running within Twisted. Let’s step through the crawl_for_quotes function. If no scraping is taking place, then a crawler is run. As mentioned before, we’re using CrawlRunner which allows for spiders to be executed within a Twisted application. CrawlRunner returns a Twisted Deferred which just means that it will “eventually” have a result. A callback is appended to eventual which will set the scrape_complete flag once the scraping is done.

Twisted’s WSGI Server

For the “pro” users and Twisted BDFL’s out there, you can use twist to easily spin up WSGI applications with minimal command:

PYTHONPATH=$(pwd) twist web --wsgi flask_twisted.app --port tcp:9000:interface=0.0.0.0

If this is strange to you or doesn’t work, then don’t stress it and just run flask_twisted.py from the git repo. I’ve provided this for anyone that may want an alternative to running the script directly.

Use Any WSGI Server

Most will want to deploy using a WSGI server like Gunicorn, and for those people flask_twisted.py will not work because Twisted needs to be running within the same thread. Fortunately, there’s a great project called crochet that allows Twisted code to run in a non-async code base. Without dwelling too much how crochet works, let’s create a new flask_crochet.py file:

import crochet
crochet.setup()     # initialize crochet

import json
from flask import Flask
from scrapy.crawler import CrawlerRunner
from quote_scraper import QuoteSpider

app = Flask('Scrape With Flask')
crawl_runner = CrawlerRunner()      # requires the Twisted reactor to run
quotes_list = []                    # store quotes
scrape_in_progress = False
scrape_complete = False

@app.route('/crawl')
def crawl_for_quotes():
    """
    Scrape for quotes
    """
    global scrape_in_progress
    global scrape_complete

    if not scrape_in_progress:
        scrape_in_progress = True
        global quotes_list
        # start the crawler and execute a callback when complete
        scrape_with_crochet(quotes_list)
        return 'SCRAPING'
    elif scrape_complete:
        return 'SCRAPE COMPLETE'
    return 'SCRAPE IN PROGRESS'

@app.route('/results')
def get_results():
    """
    Get the results only if a spider has results
    """
    global scrape_complete
    if scrape_complete:
        return json.dumps(quotes_list)
    return 'Scrape Still Progress'

@crochet.run_in_reactor
def scrape_with_crochet(_list):
    eventual = crawl_runner.crawl(QuoteSpider, quotes_list=_list)
    eventual.addCallback(finished_scrape)

def finished_scrape(null):
    """
    A callback that is fired after the scrape has completed.
    Set a flag to allow display the results from /results
    """
    global scrape_complete
    scrape_complete = True

if __name__=='__main__':
    app.run('0.0.0.0', 9000)

crochet needs to setup the ideal environment in order to work, so one of the first things developers must do is crochet.setup(). Each function that needs to be run in a Twisted thread must be wrapped with @crochet.run_in_reactor. So the difference between flask_twisted.py and flask_crochet.py is that CrawlerRunner is executed in a Twisted thread. And although the example doesn’t demonstrate this, the crawler can indeed run multiple times, essentially side skirting the ReactorNotRestartable dilemma. Without further ado (adieu?), how to run this script in Gunicorn:

gunicorn -b 0.0.0.0:9000 flask_crochet:app

Test Endpoints

Initiate scrape

curl 127.0.0.1:9000/crawl
# output: SCRAPING

Trying to scrape again without completing the first scrape

curl 127.0.0.1:9000/crawl
# output: SCRAPING IN PROGRESS

Server letting you know the scrape is complete

curl 127.0.0.1:9000/crawl
# output: SCRAPE COMPLETE

Getting the results

curl 127.0.0.1:9000/results

Getting pretty results using Python’s json.tool

curl 127.0.0.1:9000/results | python -m json.tool | less

Final Words and Caution

There are many other options to solve the issue of using Scrapy/Twisted and WSGI apps. This is merely a solution that has worked for me in the past. It’s very simple and easy to get a grasp on. However, things can quickly get chaotic when threads get involved. Developers will have to worry about shared variables, critical sections, locks, spawning too many threads, debugging, and a plethora of other nuisances. Hence why the examples are very basic. Multithreaded code is difficult, which is why a Mozilla engineer mandates a height restriction. I’m planning to demonstrate how to achieve similar results in a single thread using klein and tornado in the future, so stay tuned!


Comments

  1. Get to know about your credit options with us quickly! Its so easy, 2 more steps, call us. Personal broker for end to end hassle free loan processing. Get it done today!

    sydney mortgage broker

    ReplyDelete
  2. Really infomational and educative article thanks publisher for sharing this wonderful info i have shared this article on my blog tecktak flippzilla
    and whatsaup, and Best smart tv

    ReplyDelete
  3. The herbs in Triphala work to increase your body's metabolism and regular bowel movements. These are powerful antioxidants, anti-inflammatory compounds, and other beneficial properties.Buy Best Ayurvedic Fatty Liver Syrup

    ReplyDelete
  4. Duro Pack India is one of the biggest company where you can find the Red Colour Stand Up Pouch at very reasonable price.

    ReplyDelete
  5. Looking for Diagonal strut then you have to visit our company website name is Mini Fast India US there you will find the all kind of strut channel

    ReplyDelete
  6. Power banks are also popular and can be customised Gift to suit your needs. And, of course, you can always choose an item that will be useful to your business partners.

    ReplyDelete
  7. A tripod should also come with a handle for ease of handling. tripod for mobile and camera

    ReplyDelete
  8. Our online python training course will include advanced skills in machine learning and data science, as well as the basics of coding. Python institute in Ghaziabad

    ReplyDelete
  9. An enforceable Patent is a key to powerful Commercialization/Out-Licensing of any innovation. Regardless of how earth-shattering innovation is assuming that the portrayal of the specialized topic isn't empowering or neglects to unveil every fundamental encapsulation, or regardless of whether the cases are barely drafted and don't acquire support from the determination, the value of the innovation could be immaterial.
    Patent Application Drafting

    ReplyDelete
  10. The Digital Marketing Course in Rajouri Garden is a unique training course designed for business leaders. The institute integrates the latest social and analytics technology together with top-of-the-line communications experts to deliver actionable insights.The team analyzes and monitors 85,000 phone calls on mobile payment. Their research has played a important role in the development of MasterPass, Mastercard's digital wallet.

    ReplyDelete

Post a Comment

Popular posts from this blog

Twisted Klein: Non-Blocking Recipes

Non-Blocking Recipes Do you like expressjs , but don’t want to switch to Node.js? Want non-blocking execution in Python? Then look no further! Asynchronous execution is the very essence of what makes Klein a contender in todays web framework landscape. It’s also the most difficult concept to grasp since most Pythonistas are not accustomed to asynchronous programming. Hopefully with the addition asyncio to Python’s standard library, this will change. Klein is built atop Twisted and developers can expose Deferred objects for asynchronous behavior. A very brief overview will be given on Twisted Deferred , afterwards aspiring developers are encouraged to read the Twisted documents on the subject (provided in the links near the bottom). Deferred Overview To demonstrate how Deferred objects work, we will create a chain of callback functions that execute after a result is available. Don’t worry if this confuses you right now, the code will clear things up. from __future_

Simple self signed TLS server and client using Twisted

Self signed TLS server and client using Twisted Prerequisites openssl twisted & pyOpenSSL modules- pip install twisted[tls] treq module - pip install treq Basic knowledge of Twisted TCP servers/clients Generate self signed certificate Generate the server's private key using a secret, which is SuperSecretPassword in this case. openssl genrsa -aes256 -passout pass:SuperSecretPassword -out server.key 2048 Perform a CSR (certificate signing request). Ensure the FQDN (fully qualified domain name) matches the hostname of the server, otherwise the server won't be properly validated. openssl req -new -key server.key -passin pass:SuperSecretPassword -out server.csr # Common Name (e.g. server FQDN or YOUR name) []:localhost openssl x509 -req -passin pass:SuperSecretPassword -days 1024 -in server.csr -signkey server.key -out server.crt For development purposes, remove the password from the certificate. openssl rsa -in server.key -out server_no_pass.key -passin pas

Python alias commands that play nice with virtualenv

There are plenty of predefined Python executables, symlinks, and aliases that come bundled with your operating system. These commands come in very handy because it saves you from typing out long commands or chain of scripts. However the downfall of operating system aliases is that they don’t always play nice with virtualenv (or venv if you’re on Python 3). Most predefined aliases use the system’s default Python as the interpreter, which is next to useless when your application runs in a virtual environment. Over the years, I’ve come up with my own Python aliases that play nice with virtual environments. For this post, I tried to stay as generic as possible such that any alias here can be used by every Pythonista. In other words, there will be no aliases for specific frameworks such as running a Django server or starting a Scrappy spider. The following is one of my bash scripts I source: .py-aliases #----- Pip -----# alias pip-list="pip freeze | less" alias pip-search