GSoC 2018: A new performance-focused urlparse library for Scrapy

Continue developing the new library :)

For the last few weeks, I’ve been working on improving the library that I chose to work on, which is urlparse4. At the moment, since I have developed the forked repository to a nearly-complete different library, I have separate the repository to a self-owned repository, hosted on my Github account. I have named it Scurl (because it is the urlparse library for Scrapy :D).

What makes this library different is that it currently nearly supports the compatibility with the urllib library from Python itself, which means, it nearly passes all the test cases from urllib.parse :). In addition, I have added a function in the library, which is called canonicalize_url. It is a function that is supported in w3lib, a utility library of Scrapy. The function itself parses the urls to the “normal form”. For example, if the url contains special letters, such as whitespaces “ “, the url will look something like this: http://google.com/foo bar/foo. When you paste it into Google Chrome, of course the url will not look the same as it was after the page is fully loaded compared to the url when you pasted it in. The url from the example above will look like this after being canonicalized: http://google.com/foo%20bar/foo. The difference is the whitespace was converted to %xx format. That is just one example of canonicalize the urls.

After a few weeks working, SCURL now has these functionality:

urlparse urljoin urlsplit canonicalize_url which are the critical functions that can improve the performance of Scrapy. All of the work can be found under the Pull Request of this project . Most of them are closed since they were merged already, but you can still find them there :) Let me know if you want to know something further than this :smile:

The best way to contact me is probably through my email :)

nctl144@gmail.com