GSoC 2018: Testing Rust-url, Chromium-gurl, uriparser

How do other libraries do in term of performance and correctness?

Having seen the result of the Python libraries, which is not different from the original library urllib, I decided to take a look at other libraries. They are rust-url, chromium-gurl and uriparser

Speed test:

For the speed test, I used the chromium url test file, which contains about 83k unique urls. Here is the result for each library’s urlparse function:

Parsing speed (in seconds) urllib 0.66s rust 0.4s uriparser 0.15s gurl-cython 0.13s As we can see from the result, the 3 libraries have better performance than the original urllib library. If we use one of these 3 libraries as the replacement for urllib, I expect the performance to be 100% improved since the canonicalize_url function in scrapy, which is the function that takes a lot of time while running the Scrapy spider, uses a lot of parsing functions.

Correctness test

The way each library parse the urls is really important since we don’t want Scrapy to parse urls incorrectly. Therefore, I made a correctness test for each library to check their parsing standards. The test file can be found here, and the repo to the correctness tests is here. The result is noted below:

Number of wrong parsing result: (total testing cases: 409)

  Wrong scheme Wrong netloc Wrong path urllib 1 50 1 rust 0 0 0 uriparser 0 11 18 gurl-cython 0 34 56

Uriparser:

netloc

unmatched netloc: http://[2001::1]/, the result is: 2001::1, while it should be: [2001::1] unmatched netloc: http://[::7f00:1]/, the result is: ::7f00:1, while it should be: [::7f00:1] unmatched netloc: http://[::d01:4403]/, the result is: ::d01:4403, while it should be: [::d01:4403] unmatched netloc: http://[2001::1]/, the result is: 2001::1, while it should be: [2001::1] unmatched netloc: sc://%1F!”$&’()*+,-.;<=>^_{|}~/, the result is: %1F!, while it should be: %1F!"$&'()*+,-.;<=>^_{|}~ unmatched netloc: http://[1::]/, the result is: 1::, while it should be: [1::] unmatched netloc: non-special://[1:2:0:0:5::]/, the result is: 1:2:0:0:5::, while it should be: [1:2:0:0:5::] unmatched netloc: non-special://[1:2::3]/, the result is: 1:2::3, while it should be: [1:2::3] unmatched netloc: non-special://[1:2::3]:80/, the result is: 1:2::3, while it should be: [1:2::3] unmatched netloc: http://[0:1:0:1:0:1:0:1]/, the result is: 0:1:0:1:0:1:0:1, while it should be: [0:1:0:1:0:1:0:1] unmatched netloc: http://[1:0:1:0:1:0:1:0]/, the result is: 1:0:1:0:1:0:1:0, while it should be: [1:0:1:0:1:0:1:0]

Path:

(‘unmatched path at’, ‘a: foo.com’, ‘the result is’, ‘’, ‘expected’, ‘ foo.com’) (‘unmatched path at’, ‘lolscheme:x x#x%20x’, ‘the result is’, ‘’, ‘expected’, ‘x x’) (‘unmatched path at’, ‘http://example.org/foo/bar#\’, ‘the result is’, ‘’, ‘expected’, ‘/foo/bar’) (‘unmatched path at’, ‘http://foo/path;a??e#f#g’, ‘the result is’, ‘’, ‘expected’, ‘/path;a’) (‘unmatched path at’, ‘http://example.org/foo/[61:24:74]:98’, ‘the result is’, ‘’, ‘expected’, ‘/foo/[61:24:74]:98’) (‘unmatched path at’, ‘http://example.org/foo/[61:27]/:foo’, ‘the result is’, ‘’, ‘expected’, ‘/foo/[61:27]/:foo’) (‘unmatched path at’, ‘http://example.com/foo/%2e%2’, ‘the result is’, ‘’, ‘expected’, ‘/foo/%2e%2’) (‘unmatched path at’, ‘http://example.com/foo%’, ‘the result is’, ‘’, ‘expected’, ‘/foo%’) (‘unmatched path at’, ‘http://example.com/foo%2’, ‘the result is’, ‘’, ‘expected’, ‘/foo%2’) (‘unmatched path at’, ‘http://example.com/foo%2zbar’, ‘the result is’, ‘’, ‘expected’, ‘/foo%2zbar’) (‘unmatched path at’, ‘http://example.com/foo%2%C3%82%C2%A9zbar’, ‘the result is’, ‘’, ‘expected’, ‘/foo%2%C3%82%C2%A9zbar’) (‘unmatched path at’, ‘http://%60%7B%7D:%60%7B%7D@h/%60%7B%7D?`{}’, ‘the result is’, ‘’, ‘expected’, ‘/%60%7B%7D’) (‘unmatched path at’, ‘sc:\../’, ‘the result is’, ‘’, ‘expected’, ‘\../’) (‘unmatched path at’, ‘wow:%NBD’, ‘the result is’, ‘’, ‘expected’, ‘%NBD’) (‘unmatched path at’, ‘wow:%1G’, ‘the result is’, ‘’, ‘expected’, ‘%1G’) (‘unmatched path at’, ‘file://host/dir/C|a’, ‘the result is’, ‘’, ‘expected’, ‘/dir/C|a’) (‘unmatched path at’, ‘http://example.org/test?%GH’, ‘the result is’, ‘’, ‘expected’, ‘/test’) (‘unmatched path at’, ‘http://example.org/test?a#%GH’, ‘the result is’, ‘’, ‘expected’, ‘/test’)

Chromium GURL:

netloc

unmatched netloc at non-special://test@test/x the result is  while it should be test unmatched netloc at non-special://test/x the result is  while it should be test unmatched netloc at httpa://foo:80/ the result is  while it should be foo unmatched netloc at sc://fa%C3%9F.ExAmPlE/ the result is  while it should be fa%C3%9F.ExAmPlE unmatched netloc at sc://ho/i the result is  while it should be ho unmatched netloc at sc://ho/i the result is  while it should be ho unmatched netloc at sc://ho/i the result is  while it should be ho unmatched netloc at sc://ho/pa?i the result is  while it should be ho unmatched netloc at sc://ho/pa#i the result is  while it should be ho unmatched netloc at sc://%C3%B1.test/ the result is  while it should be %C3%B1.test unmatched netloc at sc://%1F!”$&’()*+,-.;<=>^_{|}~/ the result is  while it should be %1F!"$&'()*+,-.;<=>^_{|}~ unmatched netloc at sc://%/ the result is  while it should be % unmatched netloc at sc://%C3%B1/x the result is  while it should be %C3%B1 unmatched netloc at sc://%C3%B1 the result is  while it should be %C3%B1 unmatched netloc at sc://%C3%B1?x the result is  while it should be %C3%B1 unmatched netloc at sc://%C3%B1#x the result is  while it should be %C3%B1 unmatched netloc at sc://%C3%B1#x the result is  while it should be %C3%B1 unmatched netloc at sc://%C3%B1?x the result is  while it should be %C3%B1 unmatched netloc at tftp://foobar.com/someconfig;mode=netascii the result is  while it should be foobar.com unmatched netloc at telnet://user:pass@foobar.com:23/ the result is  while it should be foobar.com unmatched netloc at ut2004://10.10.10.10:7777/Index.ut2 the result is  while it should be 10.10.10.10 unmatched netloc at redis://foo:bar@somehost:6379/0?baz=bam&qux=baz the result is  while it should be somehost unmatched netloc at rsync://foo@host:911/sup the result is  while it should be host unmatched netloc at git://github.com/foo/bar.git the result is  while it should be github.com unmatched netloc at irc://myserver.com:6999/channel?passwd the result is  while it should be myserver.com unmatched netloc at dns://fw.example.org:9999/foo.bar.org?type=TXT the result is  while it should be fw.example.org unmatched netloc at ldap://localhost:389/ou=People,o=JNDITutorial the result is  while it should be localhost unmatched netloc at git+https://github.com/foo/bar the result is  while it should be github.com unmatched netloc at non-special://%E2%80%A0/ the result is  while it should be %E2%80%A0 unmatched netloc at non-special://H%4fSt/path the result is  while it should be H%4fSt unmatched netloc at non-special://[1:2:0:0:5::]/ the result is  while it should be [1:2:0:0:5::] unmatched netloc at non-special://[1:2::3]/ the result is  while it should be [1:2::3] unmatched netloc at non-special://[1:2::3]:80/ the result is  while it should be [1:2::3] unmatched netloc at a://b/test-a-colon-slash-slash-b.html the result is  while it should be b

path

unmatched path at non-special://test@test/x the result is //test@test/x while it should be /x unmatched path at non-special://test/x the result is //test/x while it should be /x unmatched path at foo:// the result is // while it should be unmatched path at foo:///////// the result is ///////// while it should be /////// unmatched path at foo://///////bar.com/ the result is /////////bar.com/ while it should be ///////bar.com/ unmatched path at foo:////:///// the result is ////:///// while it should be //:///// unmatched path at http://example.com/foo/%2e%2 the result is /foo/.%2 while it should be /foo/%2e%2 unmatched path at http://example.com/%2e.bar the result is /..bar while it should be /%2e.bar unmatched path at http://example.com/foo%41%7a the result is /fooAz while it should be /foo%41%7a unmatched path at http://example.com/foo%00%51 the result is /foo%00Q while it should be /foo%00%51 unmatched path at http://www/foo%2Ehtml the result is /foo.html while it should be /foo%2Ehtml unmatched path at httpa://foo:80/ the result is //foo:80/ while it should be / unmatched path at sc://fa%C3%9F.ExAmPlE/ the result is //fa%C3%9F.ExAmPlE/ while it should be / unmatched path at mailto:x@x.com#x the result is x@x.com#x while it should be x@x.com unmatched path at sc://ho/i the result is //ho/i while it should be /i unmatched path at sc:///pa/i the result is ///pa/i while it should be /pa/i unmatched path at sc://ho/i the result is //ho/i while it should be /i unmatched path at sc:///i the result is ///i while it should be /i unmatched path at sc://ho/i the result is //ho/i while it should be /i unmatched path at sc:///i the result is ///i while it should be /i unmatched path at sc://ho/pa?i the result is //ho/pa while it should be /pa unmatched path at sc:///pa/pa?i the result is ///pa/pa while it should be /pa/pa unmatched path at sc://ho/pa#i the result is //ho/pa while it should be /pa unmatched path at sc:///pa/pa#i the result is ///pa/pa while it should be /pa/pa unmatched path at sc://%C3%B1.test/ the result is //%C3%B1.test/ while it should be / unmatched path at sc://%1F!”$&’()*+,-.;<=>^_{|}~/ the result is //%1F!"$&'()*+,-.;<=>^_{|}~/ while it should be / unmatched path at sc://%/ the result is //%/ while it should be / unmatched path at sc://%C3%B1/x the result is //%C3%B1/x while it should be /x unmatched path at file://host/dir/C|a the result is /dir/C%7Ca while it should be /dir/C|a unmatched path at sc://%C3%B1 the result is //%C3%B1 while it should be unmatched path at sc://%C3%B1?x the result is //%C3%B1 while it should be unmatched path at sc://%C3%B1#x the result is //%C3%B1 while it should be unmatched path at sc://%C3%B1#x the result is //%C3%B1 while it should be unmatched path at sc://%C3%B1?x the result is //%C3%B1 while it should be unmatched path at sc://? the result is // while it should be unmatched path at sc://# the result is // while it should be unmatched path at sc:/// the result is /// while it should be / unmatched path at sc://// the result is //// while it should be // unmatched path at sc:////x/ the result is ////x/ while it should be //x/ unmatched path at tftp://foobar.com/someconfig;mode=netascii the result is //foobar.com/someconfig;mode=netascii while it should be /someconfig;mode=netascii unmatched path at telnet://user:pass@foobar.com:23/ the result is //user:pass@foobar.com:23/ while it should be / unmatched path at ut2004://10.10.10.10:7777/Index.ut2 the result is //10.10.10.10:7777/Index.ut2 while it should be /Index.ut2 unmatched path at redis://foo:bar@somehost:6379/0?baz=bam&qux=baz the result is //foo:bar@somehost:6379/0 while it should be /0 unmatched path at rsync://foo@host:911/sup the result is //foo@host:911/sup while it should be /sup unmatched path at git://github.com/foo/bar.git the result is //github.com/foo/bar.git while it should be /foo/bar.git unmatched path at irc://myserver.com:6999/channel?passwd the result is //myserver.com:6999/channel while it should be /channel unmatched path at dns://fw.example.org:9999/foo.bar.org?type=TXT the result is //fw.example.org:9999/foo.bar.org while it should be /foo.bar.org unmatched path at ldap://localhost:389/ou=People,o=JNDITutorial the result is //localhost:389/ou=People,o=JNDITutorial while it should be /ou=People,o=JNDITutorial unmatched path at git+https://github.com/foo/bar the result is //github.com/foo/bar while it should be /foo/bar unmatched path at non-special://%E2%80%A0/ the result is //%E2%80%A0/ while it should be / unmatched path at non-special://H%4fSt/path the result is //H%4fSt/path while it should be /path unmatched path at non-special://[1:2:0:0:5::]/ the result is //[1:2:0:0:5::]/ while it should be / unmatched path at non-special://[1:2::3]/ the result is //[1:2::3]/ while it should be / unmatched path at non-special://[1:2::3]:80/ the result is //[1:2::3]:80/ while it should be / unmatched path at a:///test-a-colon-slash-slash.html the result is ///test-a-colon-slash-slash.html while it should be /test-a-colon-slash-slash.html unmatched path at a://b/test-a-colon-slash-slash-b.html the result is //b/test-a-colon-slash-slash-b.html while it should be /test-a-colon-slash-slash-b.html

In addition to that, the issue mentioned in #1304 is not a problem for these 3 libraries as they handle the relative urls correctly.

As can be seen, rust-url passed all the test cases but its performance is not as good as either chromium-gurl or uriparser.

However, uriparser does not support parsing international urls. Therefore, we would not use uriparser for this project because of that. In addition, chromium-gurl does have some mistakes while handling the test cases. But after discussing with my mentor, Konstantin, we have decided to move forward with chromium-gurl since it has better performance than Rust and the failed test cases are harmless. Therefore, I will be working on building a wrapper for the chromium-gurl for the next few weeks!

The best way to contact me is probably through my email :)

nctl144@gmail.com