GSoC 2018: Developing a replacement library for urllib

A new library for parsing URLs

For the last 2 weeks, I have been working on developing urlparse4 library since my mentors and I chose it for this project. It is a library that was built on GURL class in Chromium source, wrapped by Cython so it can be packaged into a Python package. For now, the repository to the project can be found here.

At the beginning, I had to work on the Python 3 support for the library, which it lacked at the beginning. It was not that hard since I just had to run the package on Python 3 and see what error popped up. Working on fixing those errors was not really difficult. The PR can be found here.

After the Python 3 support task, we realized that the project does not support unicode input (unicode in Python 2 and string in Python 3). I then created a function that can encode the input (since the input for the urlparse function in Chromium source was the byte type in Python). The PR can be found here.

For every project in Scrapy organization, it has Travis and Codecov to track the correctness and coverage of the project (since we don’t want to work on something without running the tests). Therefore, I also worked on setting up those apps to the project. It was much easier with the tests being set up since I can always see what I do wrong in a new PR.

In addition to that, I included the urlparse test I found in urllib.parse package. After running it with tox, a lot of tests failed so I had to work on fixing those correctness issue associated with urlparse4.

urlparse4 at the time only supports urljoin and urlsplit. Therefore, I had to work on writing more functions such as urlparse, quote,… which can be found in urllib.parse library. The reason behind this is to replace those functions used in function canonicalize_url of w3lib, which is the library that is widely used in Scrapy. This task was a real challenge since I did not have that much of experience in Cython. Therefore, I had to read a lot of code in Chromium source GURL class. After a long time, I figured out that the GURL class already has canonicalize function, which has the potential to replace the current canonicalize function in w3lib. Therefore, I tried to write the Cython wrapper for the function in GURL. It turned out quite well since everything worked, except for the run time. The link to the PR can be found here.

For the following week, I will need to work on the performance problem that I currently have with the urlparse PR. I will keep this blog post updated when I do :D

The best way to contact me is probably through my email :)

nctl144@gmail.com