GSoC 2018: Replacing urlparse progress (Python libraries)

How do Python libraries do in term of performance?

The usual speed for the current version of w3lib that uses urlparse library: Scrapybench bookworm:

Python 2:

The results of the benchmark are (all speeds in items/sec) :

Test = ‘Book Spider’ Iterations = ‘1’

Mean : 51.4802678754 Median : 51.4802678754 Std Dev : 0.0

http://vmprof.com/#/9901b9d9-774a-4e00-80a0-e29e8f5f8a8a

Python 3:

The results of the benchmark are (all speeds in items/sec) :

Test = ‘Book Spider’ Iterations = ‘1’

Mean : 49.70499576829961

Median : 49.70499576829961 Std Dev : 0.0

http://vmprof.com/#/36930420-fce5-4ca7-8615-63f4dff41f70

Scrapy-bench urlparseprofile:

Python 2:

Total number of items extracted = 32799

Time spent on file_uri_to_path = 0.222077131271

Time spent on safe_url_string = 0.517620325089

Time spent on canonicalize_url = 1.09397888184

Total time taken = 1.8336763382

Rate of link extraction : 17887.0170906 items/second

http://vmprof.com/#/d8c81d05-ff18-42ef-9360-722603b0eeae

Python 3:

Total number of items extracted = 32799

Time spent on file_uri_to_path = 0.25365063053322956

Time spent on safe_url_string = 0.9272843863873277

Time spent on canonicalize_url = 1.9254305128997657

Total time taken = 3.106365529820323

Rate of link extraction : 10558.641500859412 items/second

http://vmprof.com/#/22cfad11-5566-473e-a80a-a3db0043131d

Functions that are worth analyzing: parse_qsl, parse_qsl_to_bytes, urlencode, quote, unquote_to_bytes, unquote, urlunparse

Yurl notes: Modifying w3lib only: Link to the branch: https://github.com/nctl144/w3lib/tree/yurl

Some notes on w3lib urlparse replacement:

Since yurl does not support parsing the query in the urls, quoting (replacing special characters in urls with %xx), urlunparse, urlencode which forms the query from the mapping objects -> I just replaced the urlparse function, which is implemented in parse_url function. There is no urlsplit function as well -> cannot apply this library on Safe_url_string

The profiling result is noted below after replacing urlparse with yurl (running on scrapy-bench urlparseprofile):

Python2:

Total number of items extracted = 32799

Time spent on file_uri_to_path = 0.212097167969

Time spent on safe_url_string = 0.59363079071

Time spent on canonicalize_url = 1.05083322525

Total time taken = 1.85656118393

Rate of link extraction : 17666.5333111 items/second

http://vmprof.com/#/b4d83d62-449e-4984-922e-23f15b51c87d

Python3:

Total number of items extracted = 32799

Time spent on file_uri_to_path = 0.14929384665447287

Time spent on safe_url_string = 0.892197898181621

Time spent on canonicalize_url = 1.7117554379074136

Total time taken = 2.7532471827435074

Rate of link extraction : 11912.842481261356 items/second

http://vmprof.com/#/f88e3c43-3014-4700-8a88-26a0f18ee485

According to vmprof -> The canonicalize function is the main problem in this benchmarker at least. It it clearly that the canonicalize function is much faster on Python 2 than on Python3

_safeParseresult, urlunparse are also noticeable functions that are faster on Python 2. In addition, it does not call quote on Python 2 → worth noticing

Scrapy-bench bookworm:

Python 2:

Mean : 58.0328048397 Median : 58.0328048397 Std Dev : 0.0

http://vmprof.com/#/5e225eba-f08b-4f85-a2d0-f3c50c5c6699

Python 3:

Mean : 52.71009297109393 Median : 52.71009297109393 Std Dev : 0.0

http://vmprof.com/#/3f2b093a-bb7d-4fa8-b118-357990d10736

The performance difference between 2 envs are still the same!

With scrapy code base modified as well:

The link to the branch can be found here: https://github.com/nctl144/scrapy/tree/yurl

Almost all the components that use urlparse have been replaced wth URL of yurl, except for some of the functions in urls.py because yurl does not support extracting username and password. Here is some profiling result, using scrapy-bookworm

Python 2:

The results of the benchmark are (all speeds in items/sec) :

Test = ‘Book Spider’ Iterations = ‘1’

Mean : 54.3864805312 Median : 54.3864805312 Std Dev : 0.0

http://vmprof.com/#/d1e118d2-bd46-4e8e-88e5-41635568f9b8

Python 3:

The results of the benchmark are (all speeds in items/sec) :

Test = ‘Book Spider’ Iterations = ‘1’

Mean : 49.69618510682849 Median : 49.69618510682849 Std Dev : 0.0

http://vmprof.com/#/fe208169-5aef-46ee-b86d-73d9763b6315

The parse function from Python 2 and Python 3 have significant difference (their parent function have the same runtime -> we can compare 2 parse functions directly)

On Python 2, parse takes 39.85%, while on Python 3 it takes 43%. While this is the main function that scrapy runs on, the difference will build up and create the 10 - 15% difference in speed overall.

If we go deeper into the parts of the urlparse, which is extract_link, the result is shown as following: while on Python 2, it takes 15% (20/15.8 * 12.58), on Python 3, it takes 17%

Keep observing -> canonicalize_url is faster on Python 2, so is urljoin, extract_link from link_extractor from Scrapy code base, which eventually also leads to canonicalize_url

-> It is still showing that Scrapy on Python 3 is slower than on Python 2 by 10% -> the difference is minimized a bit but we need more than that!

Purl notes: Link to the branch: https://github.com/nctl144/w3lib/tree/purl

This library does not have many functions that we need, such as quoting, putting parts of the urls together

Therefore, beside the urlparse function from urllib, we cannot replace any other function

I have replaced all the urlparse function that can be replaced -> the result is still the same:

Python 2:

Total number of items extracted = 32799

Time spent on file_uri_to_path = 0.218558073044

Time spent on safe_url_string = 0.495567321777

Time spent on canonicalize_url = 1.00913524628

Total time taken = 1.7232606411

Rate of link extraction : 19033.1045796 items/second

http://vmprof.com/#/0512074c-9fc2-48e6-a210-9115e955d8ff

Python 3:

Total number of items extracted = 32799

Time spent on file_uri_to_path = 0.7147161921602674

Time spent on safe_url_string = 0.9061613636440597

Time spent on canonicalize_url = 2.5368532372813206

Total time taken = 4.157730793085648

Rate of link extraction : 7888.678135329276 items/second

http://vmprof.com/#/862dfe3b-8dc3-409d-92cb-853e2b858dec

The result on Python 3 is horrible! As can be seen, the canonicalize time on Python 3 is about 2x larger than the time it takes on Python 2, with the time of the children functions: parse_url, urlunparse, quote significantly large on Python 3.

In addition, this library is built on urlparse -> the result makes sense

Furl notes: This library is also built based on urllib library -> We will move on to using C libraries

Yarl notes: This library has a lot of functions that resemble the urllib parse library

The link to the w3lib implementation can be found here: https://github.com/nctl144/w3lib/tree/yarl

I have replaced the components that use urlparse function in w3lib, file_uri_to_path and canonicalize_url. However, the result is not that appealing since Yarl is much slower compared to urlparse for some reason: (it is only available on Python 3)

Python 3:

The results of the benchmark are (all speeds in items/sec) :

Test = ‘Urlparse benchmarker’ Iterations = ‘1’

Mean : 3314.642303118921 Median : 3314.642303118921 Std Dev : 0.0

http://vmprof.com/#/bcb2607b-53cb-4e69-bcaa-55a6b023b7b4

In addition, I made another urltest that measures only the performance of URL in yarl and urlparse from urllib library

https://github.com/nctl144/urltest/blob/master/urlparsetest.py

As a result, urlparse is much faster than yarl:

Yarl:

the total time is 3.042547005432425 seconds

Urlparse:

the total time is 0.15865186118753627 seconds

I won’t implement it on Scrapy since the sample result is slow already!

Note: we can’t really perform replacing functions other than “urlparse function” since other pure Python libraries don’t support querying as well as quoting the urls.

-> We can only replace all the urlparse functions

As can be seen, the Python url parsing libraries are not the perfect candidates for this project at all. I will continue to make some tests with C and Rust url parsing libraries!

The best way to contact me is probably through my email :)

nctl144@gmail.com