GSoC 2018: My proposal for Google Summer of Code

If this is the first time applying to GSoC, this might be useful for you!

Below is my proposal to GSoC 2018. The organization was Python Software Foundation.

Code Sample

Project Info

Scrapinghub: Scrapy performance improvement

The project aims to define the bottleneck components of Scrapy that need improvement. These changes can speed up Scrapy’s performance.

Summary of changes

  • Upon running Scrapy on a few benchmarkers, I found that Scrapy is slower on Python 3 compared to its performance on Python 2. The main component that needs improvement is urlparse. Therefore, we can replace urlparse library with some other libraries that provide a better performance. More bottleneck components will be collected in the summer.
  • Response.css also performs more slowly than xpath without being cached. Caching the input for the css_to_xpath function will speed up Scrapy projects that use css queries.
  • Cache the selector in Item Loader to solve the speed regression with Scrapy version > 1.1.
  • HttpCacheMiddleware can be enhanced by storing the status of the responses.

Scrapy performance on Python 2 and Python 3:

After profiling Scrapy on Python 2 and Python 3 with scrapy bookworm, a benchmarker project made during GSoC 2017, and scrapy bench, a benchmarker that Scrapy supports, Scrapy’s performance has been shown to be 30% slower on Python 3 compared to Python 2. The % CPU profiling result is shown in the following tables:

Using scrapy-bench bookworm:

Python 2:

The average speed of the spider is 49.1726731424 items/sec
Vmprof profiling: http://vmprof.com/#/fbc9a2a8-c53e-4861-bb3e-8d93f2aeb53f

Python 3:

The average speed of the spider is 42.95632679033473 items/sec
VMprof profiling: http://vmprof.com/#/cf104d4e-5e67-4628-9b81-de2da0d32b7c
Python 2 Python 3  
Parse 46.5% 48.0%
_extract_request 16.5% 20.1%
extract_links 13.3% 16.9%

Using scrapy bench command:

Python 2:

2018-03-05 13:41:27 [scrapy.extensions.logstats] INFO: Crawled 653 pages (at 4320 pages/min), scraped 0 items (at 0 items/min)
Vmprof profiling: http://vmprof.com/#/f47ce555-d5b9-4e9d-a800-4fffbfd15b91/

Python 3:

2018-03-05 13:40:37 [scrapy.extensions.logstats] INFO: Crawled 421 pages (at 2040 pages/min), scraped 0 items (at 0 items/min)
Vmprof profiling: http://vmprof.com/#/62d5bd6f-bcb7-48da-8c05-9ad64817d717/
Python 2 Python 3  
Parse 30.3% 33.9%
extract_links 18.8% 29.6%
canonicalize_url 15.6% 27.1%

From this result, we can see that the parse function used in Scrapy on Python 3 takes more CPU percentage than on Python 2. In detail, the urlparse function that Scrapy is using from the urlparse library is one of the reasons why there is such difference since it performs slowly. In addition, it does not follow the modern url parsing rules on Python 2, as mentioned in this Github issue. Therefore, we need to find a replacement for the urlparse library.

We will begin with replacing the urlparse library by creating a PR for this task over the summer. After the PR is ready, we will need to investigate further in order to find more important components that can be improved. I plan on building more benchmarkers to add to project scrapy-bench. Those benchmarkers are going to help profile the components that have higher % CPU with Python 3 compared to Python 2. Based on the result I got from the tables above, the new benchmarkers will focus on extract_links, parse, extract_request,… etc.

Urlparse can be enhanced

LinkExtractor is one of the classes that are affected by urlparse. There are some API libraries in Python that can replace urlparse. For example, yurl, purl and furl are suggested python libraries. I have made some profiling tests on these libraries and compared the result to the current Python urlparse library that Scrapy is using. I summarized my results in the tables below:

1. urlparse:

Python 2 Python 3  
urlparse 0.113 second 0.179 second
purl parse 0.549 second 0.512 second
furl parse 6.299 second 4.550 second
yurl parse 0.105 second 0.100 second
Uriparse with Cython wrapper parse 0.37 second 0.29 second
GURL with Cython wrapper parse 1.16 second 1.2 second

2. urljoin:

Python 2 Python 3  
urlparse join 0.32 second 0.46 second
purl join Could not run 1.47 second
furl join 13.85 second 10.21 second
yurl join 0.32 second 0.30 second
Uriparser with Cython wrapper join 0.18 second 0.17 second

The test was a function that parses about 22,000 unique urls using those libraries.

Based on the profiling result, yurl has an impressive performance on both Python 2 and Python 3. In addition, it resembles all the functions from urlparse library, and it is also written in pure Python. However, yurl is not maintained regularly, and it lacks a license.

Libraries that are not written in pure Python are not easy to install. For example, uriparser wrapped with Cython needs autoconf, automake and libtool as prerequisites. We will also need to set up liburi, which is a C library that is based on uriparser. In addition, building a Cython wrapper for uriparser library is another task we will need to work on for this issue.

Based on the result chart, I believe we should implement replacement for urlparse library on various Scrapy components that use urlparse heavily. In this case, scrapy.linkextractors and functions from w3lib are going to be the main locations for replacement, with the candidates for the replacement are purl, yurl and uriparser using Cython wrapper. We will then profile modified Scrapy, which uses these libraries instead of Python urlparse. The next step will be comparing the result of each case and choosing the best solution out of those.

Response.css performance

Based on the profiling result from scrapy-bench cssbench, Scrapy projects using response.css are much slower than those that use response.xpath because Scrapy converts css queries to xpath to crawl data from websites.

The solution to this problem is caching the input for the function csstoxpath. By caching the input, the same css queries will not be executed multiple times, which will save us a lot of time.

For this issue, we will use functools library from Python, which has the lru_cache function for Python 3. Lru_cache will cache the input of the function so that we can avoid executing the program multiple times on the same input. Because functools.lru_cache only supports Python 3, we can use the functools32 for Python 2, which can be installed by a conditional statement in setup.py.

Since cssselector repository affects many other project components, we will add the cache for the function in parsel library.

I have made a PR for this problem, which can be found here. However, we will need to discuss on it more and make changes to it if needed. We will also need to add tests to the PR later.

Adding cache to csstoxpath function in csstranslator.py from parsel project:

class GenericTranslator(TranslatorMixin, OriginalGenericTranslator):
    @lru_cache(maxsize=256)
    def css_to_xpath(self, css, prefix='descendant-or-self::'):
        return super(GenericTranslator, self).css_to_xpath(css, prefix)


class HTMLTranslator(TranslatorMixin, OriginalHTMLTranslator):
    @lru_cache(maxsize=256)
    def css_to_xpath(self, css, prefix='descendant-or-self::'):
        return super(HTMLTranslator, self).css_to_xpath(css, prefix)

Speed regression for Scrapy > 1.1 on ItemLoader

From version 1.1 to the current version, Scrapy has been shown to perform more slowly on projects that use Item Loader compared to the earlier versions. The following test result shows that Scrapy 1.0 has better performance than Scrapy 1.5, one can crawl with 230,040 items/min and the other with 136,800 items/min. The test is a spider crawling a random url, and the parse function creates the Item Loader object 30,000 times.

On scrapy 1.0:

2018-03-07 17:22:24 [scrapy] INFO: Crawled 1 pages (at 0 pages/min), scraped 29230 items (at 230040 items/min)

On scrapy 1.5 (the latest version):

2018-03-07 17:10:03 [scrapy.extensions.logstats] INFO: Crawled 1 pages (at 0 pages/min), scraped 27985 items (at 136800 items/min)

Since the current Item Loader class does not use the selector cache that the Selector class offers, Item Loader creates a new Selector object every time it is initialized. Response.selector comes in handy, since it returns response._cached_selector if cache exists, or a Selector object of the current response if cache is not presented. Thus, it can be applied to the old code since default_selector_class attribute is a Selector object by default.

By using the response.selector, we can reduce the time spent on creating new Selector instance every time the Item Loader is initialized.

Old code:

if selector is None and response is not None:
    selector = self.default_selector_class(response)

Sample code:

if selector is None and response is not None:
    self.default_selector_class = response.selector
    selector = self.default_selector_class

This solution needs to be discussed more since it modifies the Item Loader, which can affect many other components of Scrapy.

In addition, I have created a new benchmarker for this issue for testing. The PR can be found here.

HttpCacheMiddleware can be improved

Currently the HttpCacheMiddleware checks the current response to see if it is in the cache of process_request. If there is a cached response, then the middleware will return it, else, it will mark the current response as missed and does not return anything.

However, it might be possible to download the same resource multiple times in a project. Suppose that the middleware is processing a response, but it has not done with the previous response yet. If the current response and the previous one have the same value, the middleware might download the same response twice since the previous response was not cached yet.

Therefore, one approach to this problem is storing the state of the responses, so that we will be able to know which one comes first. At the same time, we will check the value of those responses that are being processed. If there is any duplicates that are processed at the same time, we will return the previous response and cache it. After adding the status property for the response objects, we will need to write a benchmarker that profiles HttpCacheMiddleware. The benchmarker can be added into the scrapy-bench project.

More information on this idea can be found in this Github issue.

Proposal Detailed Description/Timeline

Until April 23rd: Accepted students proposals announce
  • Get more confident with Scrapy code base, especially the components that use urlparse library
  • Submit more patches to Scrapy and help with issues in the repository. This will also help me get used to the workflow and get involved more in the community
  • Write more benchmarkers to scrapy-bench since the project is heavily about profiling.
  • Collect Scrapy components to build benchmarkers.
  • Respond to the organization if there is any question about my proposal.
April 23rd - May 14th: Community Bonding Period
  • Continue to discuss about the project. Profile the potential libraries that can replace urlparse and talk about the trade-offs.
  • Finalize my approach to the issue with the mentors, such as choosing the potential url parsing libraries to replace and defining more bottleneck components that affect Scrapy performance.
  • Create benchmarkers on the chosen components so that the projects in the summer are tested more conveniently.
May 14th (Official Coding Start) - June 1: Pure Python solution suggested for Urlparse
  • Implement the chosen url parsing libraries, which are written in pure Python. Replace Python urlparse in the main components of Scrapy, such as in linkextractors and w3lib
  • Profile the result to see if the speed difference on Python 2 and Python 3 has been minimized
  • Compare the profiling result of chosen libraries.
June 2 - June 16 (first evaluation on 11-15): Cython wrapper solution suggested for Urlparse
  • Write a Cython wrapper for the chosen C libraries if needed.
  • Implement the replacement for Python urlparse with libraries that are written with Cython wrapper
  • Profile the result with the benchmarker in scrapy-bench and compare the result with the ones that were done last week
  • Check the code for bugs, make sure that the Cython wrapper runs smoothly on both Python 2 and Python 3 as the wrapper is likely to be incompatible in many cases.
June 17 - July 5: Finish up urlparse issue, discuss more about the response.css and performance issues
  • Urlparse
    • Discuss with the mentor about the profiling result on proposed approaches. Choose the best one and get the PR ready to be merged
    • Write tests for the PR
  • Response.css
    • The suggested changes for response.css have been submitted as a PR to parsel library in Scrapy, but more discussion with the core developers from parsel is required
    • Utilize the response.css PR so that it meets all the requirements
    • Profile the results again to check if the solution is good
    • Write tests to make sure the code coverage does not change then get the response.css PR ready to be merged.
  • Collect performance issues and build benchmarkers
    • Collect potential issues that are related to Scrapy performance, based on the issues reported by users and developers that are tagged as “enhancement” and “performance”
    • Implement benchmarkers for the components that are the candidates for improvement
July 7 - July 15: (second evaluation on 9-13) Discuss more about the regression on scrapy 1.5, work on potential components that need improvement
  • Regression on Scrapy 1.5
    • A sample PR has been submitted for caching the selector each time the Item Loader is created
    • Discuss more about this PR’s approach since it might affect other components
    • Create test cases for the components in Scrapy that might be affected, then share the result with the developers
    • Continue to develop this PR until it meets all the requirements.
    • Write tests for this PR and get it ready for merging.
  • Potential components that need improvement
    • Profile the potential components using the benchmarkers built last week, on both Python 2 and Python 3. Consider comparing with the previous versions of Scrapy to make sure there is no performance regression.
    • Init the issues and start discussing with the developers about the problem
    • State the solution approach for found issues and prepare for a PR
July 16 - July 31: Work on the potential fixes for the performance issues
  • Work on the potential fix for the components that need improvement, the ones that were detected last week
  • Write tests for the PR and share the profiling result with the developers
  • These performance issues might include bug fixes for Scrapy too.
August 1 - August 14: Discuss more about the HttpCacheMiddleware
  • Discuss with Scrapy developers about the approach for this issue, right now we know that we will need to add a property to the response so that we can check for duplicates.
  • Implement the solution in Scrapy code.
  • Profile the result with the benchmarker that we have made for HttpCacheMiddleware, then share the result with the developers.
  • Write the tests for the this PR and get it ready to be merged
6 August - 14 August: ‘Pencils down’ date

This week is scheduled for writing missing tests, making sure that all the PRs are ready for the evaluation

14 August - 21 August: Final evaluation

Other commitment

I don’t have any plan for this summer. If I am accepted to GSoC, I will spend all of my time working on it!

Scrapinghub - Scrapy is the only organization that I applied to.

The best way to contact me is probably through my email :)

nctl144@gmail.com