From vchimishuk at yandex.ru Thu Nov 1 22:03:43 2018 From: vchimishuk at yandex.ru (Viacheslav Chimishuk) Date: Thu, 1 Nov 2018 23:03:43 +0200 Subject: [qutebrowser] Extend ad blocking to allow single URL blocking. Message-ID: <20181101210343.5zkziu5jfzimh752@t420.localdomain> Hi, guys. I'm thinking about extending existing ad blocking module to allow single URL blocking. For now we support two file formats: host-per-line and UNIX hosts file format. My idea is the add third format support: URL substring to block. For instance, if line is foo.com/bar (file format need to be considered) every URL containing this substring will be blocked. Looks simple and backward compatible. Having this feature we can block ads on Youtube with a single line. The thing many people is asking for a long time. What do you think? -- Best regards, Viacheslav Chimishuk vchimishuk at yandex.ru Ukraine, Khmelnitsky From felix.vanderjeugt at gmail.com Thu Nov 1 23:26:44 2018 From: felix.vanderjeugt at gmail.com (Felix Van der Jeugt) Date: Thu, 01 Nov 2018 23:26:44 +0100 Subject: [qutebrowser] Extend ad blocking to allow single URL blocking. In-Reply-To: <20181101210343.5zkziu5jfzimh752@t420.localdomain> References: <20181101210343.5zkziu5jfzimh752@t420.localdomain> Message-ID: <154111120397.18334.13631215283127282790@abysm> Hi, Sounds great. The Compiler is currently working on a plugin system, which I believe would make this even easier to implement. If I recall correctly, even the current hostblocking would be extracted to a plugin so we could just extend that to include the functionality you described. Sincerely, Felix -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 833 bytes Desc: signature URL: From me at the-compiler.org Fri Nov 2 10:44:08 2018 From: me at the-compiler.org (Florian Bruhin) Date: Fri, 2 Nov 2018 10:44:08 +0100 Subject: [qutebrowser] Extend ad blocking to allow single URL blocking. In-Reply-To: <20181101210343.5zkziu5jfzimh752@t420.localdomain> References: <20181101210343.5zkziu5jfzimh752@t420.localdomain> Message-ID: <20181102094408.42zx7jynaa3jpwxs@hooch.localdomain> Hey! On Thu, Nov 01, 2018 at 11:03:43PM +0200, Viacheslav Chimishuk wrote: > I'm thinking about extending existing ad blocking module to allow > single URL blocking. For now we support two file formats: > host-per-line and UNIX hosts file format. My idea is the add third > format support: URL substring to block. For instance, if line is > foo.com/bar (file format need to be considered) every URL containing > this substring will be blocked. > Looks simple and backward compatible. > > Having this feature we can block ads on Youtube with a single > line. The thing many people is asking for a long time. > > What do you think? How would the line to block Youtube ads look like? FWIW I'm against introducing another URL pattern format, and I'd rather use the existing URL pattern class which is also used for ":set -u" (aka "per-domain settings"). See: https://github.com/qutebrowser/qutebrowser/issues/4188 Note that I'm busy with university work on qutebrowser at least until christmas, so my time to look at PRs is quite limited currently: https://lists.schokokeks.org/pipermail/qutebrowser-announce/2018-September/000051.html https://lists.schokokeks.org/pipermail/qutebrowser-announce/2018-October/000053.html Florian -- https://www.qutebrowser.org | me at the-compiler.org (Mail/XMPP) GPG: 916E B0C8 FD55 A072 | https://the-compiler.org/pubkey.asc I love long mails! | https://email.is-not-s.ms/ -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 833 bytes Desc: not available URL: From vchimishuk at yandex.ru Fri Nov 2 19:28:30 2018 From: vchimishuk at yandex.ru (Viacheslav Chimishuk) Date: Fri, 2 Nov 2018 20:28:30 +0200 Subject: [qutebrowser] Extend ad blocking to allow single URL blocking. In-Reply-To: <20181102094408.42zx7jynaa3jpwxs@hooch.localdomain> References: <20181101210343.5zkziu5jfzimh752@t420.localdomain> <20181102094408.42zx7jynaa3jpwxs@hooch.localdomain> Message-ID: <20181102182830.atex64nw2hu22shb@t420.localdomain> > How would the line to block Youtube ads look like? > > FWIW I'm against introducing another URL pattern format, and I'd rather > use the existing URL pattern class which is also used for ":set -u" > (aka "per-domain settings"). See: > https://github.com/qutebrowser/qutebrowser/issues/4188 I was thinking more about simple substring matching, which is much faster then regular expressions we use for options. And should be enough for blocking. For example, I banned Youtube ads with next simple code. u = url.toDisplayString() if 'youtube.com/get_midroll_' in u: return True I propose to treat `~` prefix as a substring instead of domain name. For example, next file parses to substrings list. $ cat ads-blocked.txt ~youtube.com/get_midroll_ ~youtube.com/csi_204 evil-site.com ~.google.com/pagead/ As you can see, you can mix it with regular domain names. This allows user to maintain its own list or comunity can create and support such file and serve it from GitHub or qutebrowser.org. As a part of this change it would be great to support `file://` protocol in `content.host_blocking.lists` list, so user can serve the file from the disk without running web-server (if it is not implemented already, haven't checked it). > Note that I'm busy with university work on qutebrowser at least until > christmas, so my time to look at PRs is quite limited currently: > https://lists.schokokeks.org/pipermail/qutebrowser-announce/2018-September/000051.html > https://lists.schokokeks.org/pipermail/qutebrowser-announce/2018-October/000053.html Yeah... Maybe other people can help you with that and accept PRs? :) I can see Jay Kamat is very active in the project. In this case QB can progress and grow faster. -- Best regards, Viacheslav Chimishuk vchimishuk at yandex.ru Ukraine, Khmelnitsky From vchimishuk at yandex.ru Fri Nov 2 19:32:23 2018 From: vchimishuk at yandex.ru (Viacheslav Chimishuk) Date: Fri, 2 Nov 2018 20:32:23 +0200 Subject: [qutebrowser] Extend ad blocking to allow single URL blocking. In-Reply-To: <154111120397.18334.13631215283127282790@abysm> References: <20181101210343.5zkziu5jfzimh752@t420.localdomain> <154111120397.18334.13631215283127282790@abysm> Message-ID: <20181102183223.xalfr3dthmiw2eup@t420.localdomain> > Sounds great. The Compiler is currently working on a plugin system, > which I believe would make this even easier to implement. If I recall > correctly, even the current hostblocking would be extracted to a plugin > so we could just extend that to include the functionality you > described. Yeah, it is relatively simple change. After the whole adblock.py can be migrated to a plugin framework. This is the plan. -- Best regards, Viacheslav Chimishuk vchimishuk at yandex.ru Ukraine, Khmelnitsky From vchimishuk at yandex.ru Fri Nov 2 19:54:58 2018 From: vchimishuk at yandex.ru (Viacheslav Chimishuk) Date: Fri, 2 Nov 2018 20:54:58 +0200 Subject: [qutebrowser] Extend ad blocking to allow single URL blocking. In-Reply-To: <20181102182830.atex64nw2hu22shb@t420.localdomain> References: <20181101210343.5zkziu5jfzimh752@t420.localdomain> <20181102094408.42zx7jynaa3jpwxs@hooch.localdomain> <20181102182830.atex64nw2hu22shb@t420.localdomain> Message-ID: <20181102185458.ofjtip36p5vmw2fu@t420.localdomain> I can see ublock uses regexp for its list. Implementing regexp support we can reuse their files, which will get us a huge benefit. Unfortunately their format is not compatible with UrlPattern we use and looks much more complicated than a simple regexp. IMHO, it is better to start with simple substring match for now. https://github.com/gorhill/uBlock/wiki/Static-filter-syntax On 02.11.2018 20:28, Viacheslav Chimishuk wrote: > > How would the line to block Youtube ads look like? > > > > FWIW I'm against introducing another URL pattern format, and I'd rather > > use the existing URL pattern class which is also used for ":set -u" > > (aka "per-domain settings"). See: > > https://github.com/qutebrowser/qutebrowser/issues/4188 > > I was thinking more about simple substring matching, which is much > faster then regular expressions we use for options. And should be > enough for blocking. For example, I banned Youtube ads with next > simple code. > > u = url.toDisplayString() > if 'youtube.com/get_midroll_' in u: > return True > > I propose to treat `~` prefix as a substring instead of domain > name. For example, next file parses to substrings list. > > $ cat ads-blocked.txt > ~youtube.com/get_midroll_ > ~youtube.com/csi_204 > evil-site.com > ~.google.com/pagead/ > > As you can see, you can mix it with regular domain names. This allows > user to maintain its own list or comunity can create and support such > file and serve it from GitHub or qutebrowser.org. > As a part of this change it would be great to support `file://` > protocol in `content.host_blocking.lists` list, so user can serve the > file from the disk without running web-server (if it is not > implemented already, haven't checked it). > > > > Note that I'm busy with university work on qutebrowser at least until > > christmas, so my time to look at PRs is quite limited currently: > > https://lists.schokokeks.org/pipermail/qutebrowser-announce/2018-September/000051.html > > https://lists.schokokeks.org/pipermail/qutebrowser-announce/2018-October/000053.html > > Yeah... Maybe other people can help you with that and accept PRs? :) I > can see Jay Kamat is very active in the project. In this case QB can > progress and grow faster. > > -- > Best regards, Viacheslav Chimishuk > vchimishuk at yandex.ru > Ukraine, Khmelnitsky -- Best regards, Viacheslav Chimishuk vchimishuk at yandex.ru Ukraine, Khmelnitsky From me at the-compiler.org Fri Nov 2 21:47:24 2018 From: me at the-compiler.org (Florian Bruhin) Date: Fri, 2 Nov 2018 21:47:24 +0100 Subject: [qutebrowser] Extend ad blocking to allow single URL blocking. In-Reply-To: <20181102182830.atex64nw2hu22shb@t420.localdomain> <20181102185458.ofjtip36p5vmw2fu@t420.localdomain> Message-ID: <20181102204724.4lpeznh3ho77zxta@hooch.localdomain> On Fri, Nov 02, 2018 at 08:28:30PM +0200, Viacheslav Chimishuk wrote: > > How would the line to block Youtube ads look like? > > > > FWIW I'm against introducing another URL pattern format, and I'd rather > > use the existing URL pattern class which is also used for ":set -u" > > (aka "per-domain settings"). See: > > https://github.com/qutebrowser/qutebrowser/issues/4188 > > I was thinking more about simple substring matching, which is much > faster then regular expressions we use for options. And should be > enough for blocking. Options/URL patterns don't use regular expressions. > I propose to treat `~` prefix as a substring instead of domain > name. For example, next file parses to substrings list. Again, use URL patterns. I will not accept any contributions inventing yet another URL format, no matter how simple. Let's be consistent with what the rest of qutebrowser uses. With URL patterns, I think they should even be backwards-compatible with the "one host per line" format. > As a part of this change it would be great to support `file://` > protocol in `content.host_blocking.lists` list, so user can serve the > file from the disk without running web-server (if it is not > implemented already, haven't checked it). It already is. > > Note that I'm busy with university work on qutebrowser at least until > > christmas, so my time to look at PRs is quite limited currently: > > https://lists.schokokeks.org/pipermail/qutebrowser-announce/2018-September/000051.html > > https://lists.schokokeks.org/pipermail/qutebrowser-announce/2018-October/000053.html > > Yeah... Maybe other people can help you with that and accept PRs? :) I > can see Jay Kamat is very active in the project. In this case QB can > progress and grow faster. That's already the case, but it won't change that there'll be conflicts when I'm refactoring things (like moving the adblocker to a plugin) while people are contributing stuff. I won't be able to accommodate for that. On Fri, Nov 02, 2018 at 08:54:58PM +0200, Viacheslav Chimishuk wrote: > I can see ublock uses regexp for its list. Implementing regexp support > we can reuse their files, which will get us a huge > benefit. Unfortunately their format is not compatible with UrlPattern > we use and looks much more complicated than a simple regexp. IMHO, it > is better to start with simple substring match for now. > > https://github.com/gorhill/uBlock/wiki/Static-filter-syntax Yeah, the existing uBlock/ABP syntax would be great to support: https://github.com/qutebrowser/qutebrowser/issues/29 I agree that's probably a bigger undertaking though. Florian -- https://www.qutebrowser.org | me at the-compiler.org (Mail/XMPP) GPG: 916E B0C8 FD55 A072 | https://the-compiler.org/pubkey.asc I love long mails! | https://email.is-not-s.ms/ -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 833 bytes Desc: not available URL: From jaygkamat at gmail.com Sat Nov 3 18:24:24 2018 From: jaygkamat at gmail.com (Jay Kamat) Date: Sat, 03 Nov 2018 10:24:24 -0700 Subject: [qutebrowser] Extend ad blocking to allow single URL blocking. In-Reply-To: <20181102182830.atex64nw2hu22shb@t420.localdomain> References: <20181101210343.5zkziu5jfzimh752@t420.localdomain> <20181102094408.42zx7jynaa3jpwxs@hooch.localdomain> <20181102182830.atex64nw2hu22shb@t420.localdomain> Message-ID: <87va5edqgn.fsf@gmail.com> Viacheslav Chimishuk writes: >> How would the line to block Youtube ads look like? >> >> FWIW I'm against introducing another URL pattern format, and I'd rather >> use the existing URL pattern class which is also used for ":set -u" >> (aka "per-domain settings"). See: >> https://github.com/qutebrowser/qutebrowser/issues/4188 > > I was thinking more about simple substring matching, which is much > faster then regular expressions we use for options. And should be > enough for blocking. For example, I banned Youtube ads with next > simple code. > > u = url.toDisplayString() > if 'youtube.com/get_midroll_' in u: > return True > > I propose to treat `~` prefix as a substring instead of domain > name. For example, next file parses to substrings list. > > $ cat ads-blocked.txt > ~youtube.com/get_midroll_ > ~youtube.com/csi_204 > evil-site.com > ~.google.com/pagead/ FWIW, (as The-Compiler already said) the existing URL match patterns should work in this way too. I think these would be the translations of the examples: *.youtube.com/get_midroll_/* *.youtube.com/csi_204/* *.evil-site.com/* *.google.com/pagead/* I definitely agree that we should only have one match pattern, unless we are adopting some other format, like ABP or uBlock. I'm not sure if there's a single standard for block lists though, it looks like even ABP and uBlock have slightly different standards. From a quick look, it looks like they are more complicated than just blocking urls, they also have support for hiding individual elements on a page. > Yeah... Maybe other people can help you with that and accept PRs? :) I > can see Jay Kamat is very active in the project. In this case QB can > progress and grow faster. FWIW, my reviews are (in general) worse than The-Compiler's reviews, as after my reviews are done, he usually catches some important things I missed. Because of that, I try to avoid actually merging things unless other people have also reviewed and tested it (which hasn't happened too much). There are a couple areas (such as tab sizing and pinned tabs) which I would feel comfortable with managing, but The-Compiler still has a much better overall picture of everything :). From vchimishuk at yandex.ru Sun Nov 4 18:16:45 2018 From: vchimishuk at yandex.ru (Viacheslav Chimishuk) Date: Sun, 4 Nov 2018 19:16:45 +0200 Subject: [qutebrowser] Extend ad blocking to allow single URL blocking. In-Reply-To: <20181102204724.4lpeznh3ho77zxta@hooch.localdomain> References: <20181102182830.atex64nw2hu22shb@t420.localdomain> <20181102185458.ofjtip36p5vmw2fu@t420.localdomain> <20181102204724.4lpeznh3ho77zxta@hooch.localdomain> Message-ID: <20181104171645.e5trglgvqarxspyy@t420.localdomain> > > I propose to treat `~` prefix as a substring instead of domain > > name. For example, next file parses to substrings list. > > Again, use URL patterns. I will not accept any contributions inventing > yet another URL format, no matter how simple. Let's be consistent with > what the rest of qutebrowser uses. With URL patterns, I think they > should even be backwards-compatible with the "one host per line" format. I guess implementing Chrome's URL patter syntax here can be harder to implement. Ok, I'll try to implement it the way you proposed. Simple domain name is not compatible with pattern syntax -- it requires scheme and path parts. In this case we need to treat foo.com as *://foo.com/* > > > Note that I'm busy with university work on qutebrowser at least until > > > christmas, so my time to look at PRs is quite limited currently: > > > https://lists.schokokeks.org/pipermail/qutebrowser-announce/2018-September/000051.html > > > https://lists.schokokeks.org/pipermail/qutebrowser-announce/2018-October/000053.html > > > > Yeah... Maybe other people can help you with that and accept PRs? :) I > > can see Jay Kamat is very active in the project. In this case QB can > > progress and grow faster. > > That's already the case, but it won't change that there'll be conflicts > when I'm refactoring things (like moving the adblocker to a plugin) > while people are contributing stuff. I won't be able to accommodate for > that. I thought that adblock will be moved outside the core later, after plugins support will be finished. Like you introduce plugin support and after extract existing core parts (which you want to be a plugin) one by one. -- Best regards, Viacheslav Chimishuk vchimishuk at yandex.ru Ukraine, Khmelnitsky From me at the-compiler.org Sun Nov 4 18:23:23 2018 From: me at the-compiler.org (Florian Bruhin) Date: Sun, 4 Nov 2018 18:23:23 +0100 Subject: [qutebrowser] Extend ad blocking to allow single URL blocking. In-Reply-To: <20181104171645.e5trglgvqarxspyy@t420.localdomain> References: <20181102182830.atex64nw2hu22shb@t420.localdomain> <20181102185458.ofjtip36p5vmw2fu@t420.localdomain> <20181102204724.4lpeznh3ho77zxta@hooch.localdomain> <20181104171645.e5trglgvqarxspyy@t420.localdomain> Message-ID: <20181104172323.mq2nnbg5msahdipm@hooch.localdomain> On Sun, Nov 04, 2018 at 07:16:45PM +0200, Viacheslav Chimishuk wrote: > > > I propose to treat `~` prefix as a substring instead of domain > > > name. For example, next file parses to substrings list. > > > > Again, use URL patterns. I will not accept any contributions inventing > > yet another URL format, no matter how simple. Let's be consistent with > > what the rest of qutebrowser uses. With URL patterns, I think they > > should even be backwards-compatible with the "one host per line" format. > > I guess implementing Chrome's URL patter syntax here can be harder to > implement. Ok, I'll try to implement it the way you proposed. It's as simple as "urlmatch.UrlPattern(pattern).matches(url)". > Simple domain name is not compatible with pattern syntax -- it > requires scheme and path parts. In this case we need to treat foo.com > as *://foo.com/* qutebrowser's UrlPattern lifts that restriction, so you can also easily do ":set -u foo.com content.javascript.enabled true" for example. It assumes * for empty parts by default. > > > > Note that I'm busy with university work on qutebrowser at least until > > > > christmas, so my time to look at PRs is quite limited currently: > > > > https://lists.schokokeks.org/pipermail/qutebrowser-announce/2018-September/000051.html > > > > https://lists.schokokeks.org/pipermail/qutebrowser-announce/2018-October/000053.html > > > > > > Yeah... Maybe other people can help you with that and accept PRs? :) I > > > can see Jay Kamat is very active in the project. In this case QB can > > > progress and grow faster. > > > > That's already the case, but it won't change that there'll be conflicts > > when I'm refactoring things (like moving the adblocker to a plugin) > > while people are contributing stuff. I won't be able to accommodate for > > that. > > I thought that adblock will be moved outside the core later, after > plugins support will be finished. Like you introduce plugin support > and after extract existing core parts (which you want to be a plugin) > one by one. Extracting core code is part of the university project work. It'll be done before the plugin API gets opened for external third-party plugins, so it sees some actual usage first. Florian -- https://www.qutebrowser.org | me at the-compiler.org (Mail/XMPP) GPG: 916E B0C8 FD55 A072 | https://the-compiler.org/pubkey.asc I love long mails! | https://email.is-not-s.ms/ -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 833 bytes Desc: not available URL: From me at the-compiler.org Mon Nov 5 10:17:50 2018 From: me at the-compiler.org (Florian Bruhin) Date: Mon, 5 Nov 2018 10:17:50 +0100 Subject: [qutebrowser] Extend ad blocking to allow single URL blocking. In-Reply-To: <20181104191923.t7rq7d34xzmwvbcb@t420.localdomain> References: <20181102182830.atex64nw2hu22shb@t420.localdomain> <20181102185458.ofjtip36p5vmw2fu@t420.localdomain> <20181102204724.4lpeznh3ho77zxta@hooch.localdomain> <20181104171645.e5trglgvqarxspyy@t420.localdomain> <20181104172323.mq2nnbg5msahdipm@hooch.localdomain> <20181104191923.t7rq7d34xzmwvbcb@t420.localdomain> Message-ID: <20181105091750.pps5czomktmu3rop@hooch.localdomain> Hey, (re-adding the list to the Cc, I assume you accidentally didn't include it in your reply) On Sun, Nov 04, 2018 at 09:19:23PM +0200, Viacheslav Chimishuk wrote: > > On Sun, Nov 04, 2018 at 07:16:45PM +0200, Viacheslav Chimishuk wrote: > > > > > I propose to treat `~` prefix as a substring instead of domain > > > > > name. For example, next file parses to substrings list. > > > > > > > > Again, use URL patterns. I will not accept any contributions inventing > > > > yet another URL format, no matter how simple. Let's be consistent with > > > > what the rest of qutebrowser uses. With URL patterns, I think they > > > > should even be backwards-compatible with the "one host per line" format. > > > > > > I guess implementing Chrome's URL patter syntax here can be harder to > > > implement. Ok, I'll try to implement it the way you proposed. > > > > It's as simple as "urlmatch.UrlPattern(pattern).matches(url)". > > The only thing I'm not sure about is a simple loop with calling for > the every pattern. In case of a thousands of patterns it should be > very slow. I was thinking more about compiling patterns to some kind > of tree, so match complexity for a single URL will be log(n) instead > of n. > Looks like you are fine if we can start from naive looping > implementation first. Ain't you? That's true, and a valid concern. Currently we just can test for membership of a host in a set, which is O(1). Host blocking lookups are done quite a lot, and there are over 50k blocked hosts with the default blocklist, so having that O(n) instead of O(1) is probably a noticable performance difference. I added a benchmark test for adblock lookups: https://github.com/qutebrowser/qutebrowser/commit/27d4796c2f9ced450ceaf3569de228532ce08df7 Currently looking up whether a host is blocked takes 3us (median). When changing the lookup to: any(h == host for h in self._blocked_hosts) instead of host in self._blocked_hosts it instead takes 5.6ms, so almost 2000 times slower. Seeing that there can be hundreds of adblock lookups for a page load, that's not acceptable. Like you say, the best way out would be having some kind of UrlPatternSet class which can be more intelligent about lookups. I've had some thoughts on that noted down somewhere already, but I can't find them right now. Basically, you could hash patterns by their hosts, and then if you want to check whether foobar.example.com is contained, make some O(1) lookups for: - foobar.example.com - *.example.com - *.com - * However, ensuring that this works efficiently for all imaginable cases probably requires some more thought. I'd instead propose the following: - Add a UrlPatternSet class which we can extend like explained above over time. - For now, just add a special case for patterns in the form of *://example.com/* (all schemes, constant host, all paths), by looking them up in a Python set. For all patterns which are more complex than that, fall back to iterating. That solution should be quite straightforward for now, and result in a huge speedup for the common case of adblocking lists. What do you think? Florian -- https://www.qutebrowser.org | me at the-compiler.org (Mail/XMPP) GPG: 916E B0C8 FD55 A072 | https://the-compiler.org/pubkey.asc I love long mails! | https://email.is-not-s.ms/ -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 833 bytes Desc: not available URL: From vchimishuk at yandex.ru Mon Nov 5 19:12:12 2018 From: vchimishuk at yandex.ru (Viacheslav Chimishuk) Date: Mon, 5 Nov 2018 20:12:12 +0200 Subject: [qutebrowser] Extend ad blocking to allow single URL blocking. In-Reply-To: <20181105091750.pps5czomktmu3rop@hooch.localdomain> References: <20181102182830.atex64nw2hu22shb@t420.localdomain> <20181102185458.ofjtip36p5vmw2fu@t420.localdomain> <20181102204724.4lpeznh3ho77zxta@hooch.localdomain> <20181104171645.e5trglgvqarxspyy@t420.localdomain> <20181104172323.mq2nnbg5msahdipm@hooch.localdomain> <20181104191923.t7rq7d34xzmwvbcb@t420.localdomain> <20181105091750.pps5czomktmu3rop@hooch.localdomain> Message-ID: <20181105181212.l54dflxbict7v7gf@t420.localdomain> > Hey, > > (re-adding the list to the Cc, I assume you accidentally didn't include > it in your reply) Hi, Florian. Yes. Sorry, my fault. Thanks for adding it back. > That's true, and a valid concern. Currently we just can test for > membership of a host in a set, which is O(1). Host blocking lookups are > done quite a lot, and there are over 50k blocked hosts with the default > blocklist, so having that O(n) instead of O(1) is probably a noticable > performance difference. > > I added a benchmark test for adblock lookups: > https://github.com/qutebrowser/qutebrowser/commit/27d4796c2f9ced450ceaf3569de228532ce08df7 > > Currently looking up whether a host is blocked takes 3us (median). > When changing the lookup to: > > any(h == host for h in self._blocked_hosts) > > instead of > > host in self._blocked_hosts > > it instead takes 5.6ms, so almost 2000 times slower. Seeing that there > can be hundreds of adblock lookups for a page load, that's not > acceptable. > > > Like you say, the best way out would be having some kind of > UrlPatternSet class which can be more intelligent about lookups. I've > had some thoughts on that noted down somewhere already, but I can't find > them right now. Basically, you could hash patterns by their hosts, and > then if you want to check whether foobar.example.com is contained, make > some O(1) lookups for: > > - foobar.example.com > - *.example.com > - *.com > - * > > However, ensuring that this works efficiently for all imaginable cases > probably requires some more thought. > > I'd instead propose the following: > > - Add a UrlPatternSet class which we can extend like explained above > over time. > - For now, just add a special case for patterns in the form of > *://example.com/* (all schemes, constant host, all paths), by looking > them up in a Python set. For all patterns which are more complex than > that, fall back to iterating. > > That solution should be quite straightforward for now, and result in a > huge speedup for the common case of adblocking lists. > > What do you think? Sounds like an excellent plan. I'll implement it the way you described. After that I'll try to investigate and find a better way to speed it up. -- Best regards, Viacheslav Chimishuk vchimishuk at yandex.ru Ukraine, Khmelnitsky From tgy at inria.fr Fri Nov 9 10:35:59 2018 From: tgy at inria.fr (Valentin Iovene) Date: Fri, 9 Nov 2018 10:35:59 +0100 Subject: [qutebrowser] Save PDF file opened with pdfjs Message-ID: <20181109093559.lhi56kvv74cefwgh@valoup> Hi, I'm using pdf.js to open PDF files directly in qutebrowser. My ~/.qutebrowser/config.py thus contains this line to enable it: c.content.pdfjs = True Sometimes I however want to save the PDF. There is a Download button embedded in pdf.js (top right corner) but it doesn't do anything when I click it. Maybe this is a bug? My qute://version qutebrowser v1.5.1 Git commit: c7a21b37e-dirty (2018-10-10 08:23:17 +0200) Backend: QtWebEngine (Chromium 65.0.3325.230) CPython: 3.7.0 Qt: 5.11.2 PyQt: 5.11.3 Style: QMacStyle Platform: Darwin-18.0.0-x86_64-i386-64bit, 64bit Frozen: True Imported from /Applications/qutebrowser.app/Contents/MacOS/qutebrowser Using Python from /Applications/qutebrowser.app/Contents/MacOS/qutebrowser Qt library executable path: /Applications/qutebrowser.app/Contents/MacOS/PyQt5/Qt/libexec, data path: /Applications/qutebrowser.app/Contents/MacOS/PyQt5/Qt OS Version: 10.14, x86_64 My current process is to disable pdfjs, download the PDF by re-opening the initial PDF URL, re-enabling pdfjs. But this is cumbersome. Also, where are temporary PDF files located? This could be an alternative solution to just cp the PDF from a terminal. Thanks for your help, -- Valentin From me at the-compiler.org Fri Nov 9 10:42:44 2018 From: me at the-compiler.org (Florian Bruhin) Date: Fri, 9 Nov 2018 10:42:44 +0100 Subject: [qutebrowser] Save PDF file opened with pdfjs In-Reply-To: <20181109093559.lhi56kvv74cefwgh@valoup> References: <20181109093559.lhi56kvv74cefwgh@valoup> Message-ID: <20181109094244.jgpkvyuhcpobk4xq@hooch.localdomain> Hey Valentin, On Fri, Nov 09, 2018 at 10:35:59AM +0100, Valentin Iovene wrote: > Sometimes I however want to save the PDF. There is a Download button > embedded in pdf.js (top right corner) but it doesn't do anything when I > click it. Maybe this is a bug? It's complicated - in short, it's a PDF.js bug (or feature - it silently ignores downloads not coming from a http/https URL) which triggers because of a workaround for a Qt bug. It should be possible to disable that workaround with Qt 5.12, where the download button should start working. > My current process is to disable pdfjs, download the PDF by re-opening > the initial PDF URL, re-enabling pdfjs. But this is cumbersome. You can instead disable content.pdfjs and hit Ctrl-P (:prompt-open-download --pdfjs) in the download prompt. > Also, where are temporary PDF files located? This could be an > alternative solution to just cp the PDF from a terminal. In a qutebrowser-download-... directory in your OS' temporary directory. Don't remember where that was on macOS, but you can do :debug-console and this: from qutebrowser.browser import downloads downloads.temp_download_manager.get_tmpdir().name Florian -- https://www.qutebrowser.org | me at the-compiler.org (Mail/XMPP) GPG: 916E B0C8 FD55 A072 | https://the-compiler.org/pubkey.asc I love long mails! | https://email.is-not-s.ms/ -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 833 bytes Desc: not available URL: From tgy at inria.fr Fri Nov 9 10:57:04 2018 From: tgy at inria.fr (Valentin Iovene) Date: Fri, 9 Nov 2018 10:57:04 +0100 Subject: [qutebrowser] Save PDF file opened with pdfjs In-Reply-To: <20181109094244.jgpkvyuhcpobk4xq@hooch.localdomain> References: <20181109093559.lhi56kvv74cefwgh@valoup> <20181109094244.jgpkvyuhcpobk4xq@hooch.localdomain> Message-ID: <20181109095704.ghbtcvjx5u4m7xmu@valoup> Hey, thanks for you quick reply. On 09.11.2018 10:42, Florian Bruhin wrote: > You can instead disable content.pdfjs and hit Ctrl-P > (:prompt-open-download --pdfjs) in the download prompt. Is there a way I can make a keyboard shortcut that 1. disables pdfjs 2. download the file in ~/Downloads without asking 3. re-enables pdfjs? Or maybe I'll just wait for Qt 5.12! > In a qutebrowser-download-... directory in your OS' temporary > directory. Don't remember where that was on macOS, but you can do > :debug-console and this: > > from qutebrowser.browser import downloads > downloads.temp_download_manager.get_tmpdir().name This gave me /var/folders/d9/c026wqfd69d4v4__gg3dfn0h0000gq/T/qutebrowser-downloads-0q1p1npg No wonder I couldn't find it haha!! Thanks Florian :) -- Valentin From me at the-compiler.org Fri Nov 9 11:11:08 2018 From: me at the-compiler.org (Florian Bruhin) Date: Fri, 9 Nov 2018 11:11:08 +0100 Subject: [qutebrowser] Save PDF file opened with pdfjs In-Reply-To: <20181109095704.ghbtcvjx5u4m7xmu@valoup> References: <20181109093559.lhi56kvv74cefwgh@valoup> <20181109094244.jgpkvyuhcpobk4xq@hooch.localdomain> <20181109095704.ghbtcvjx5u4m7xmu@valoup> Message-ID: <20181109101108.pnwhdlpzwnequoy2@hooch.localdomain> On Fri, Nov 09, 2018 at 10:57:04AM +0100, Valentin Iovene wrote: > Hey, thanks for you quick reply. > > On 09.11.2018 10:42, Florian Bruhin wrote: > > You can instead disable content.pdfjs and hit Ctrl-P > > (:prompt-open-download --pdfjs) in the download prompt. > > Is there a way I can make a keyboard shortcut that > 1. disables pdfjs > 2. download the file in ~/Downloads without asking > 3. re-enables pdfjs? Don't think so, as you can't get the original URL easily when PDF.js is already open. > > In a qutebrowser-download-... directory in your OS' temporary > > directory. Don't remember where that was on macOS, but you can do > > :debug-console and this: > > > > from qutebrowser.browser import downloads > > downloads.temp_download_manager.get_tmpdir().name > > This gave me > > /var/folders/d9/c026wqfd69d4v4__gg3dfn0h0000gq/T/qutebrowser-downloads-0q1p1npg > > No wonder I couldn't find it haha!! That /var/folders/.../ stuff seems to be the macOS equivalent to what /tmp is on Linux ;) Florian -- https://www.qutebrowser.org | me at the-compiler.org (Mail/XMPP) GPG: 916E B0C8 FD55 A072 | https://the-compiler.org/pubkey.asc I love long mails! | https://email.is-not-s.ms/ -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 833 bytes Desc: not available URL: From tgy at inria.fr Fri Nov 9 11:32:39 2018 From: tgy at inria.fr (Valentin Iovene) Date: Fri, 9 Nov 2018 11:32:39 +0100 Subject: [qutebrowser] Save PDF file opened with pdfjs In-Reply-To: <20181109101108.pnwhdlpzwnequoy2@hooch.localdomain> References: <20181109093559.lhi56kvv74cefwgh@valoup> <20181109094244.jgpkvyuhcpobk4xq@hooch.localdomain> <20181109095704.ghbtcvjx5u4m7xmu@valoup> <20181109101108.pnwhdlpzwnequoy2@hooch.localdomain> Message-ID: <20181109103239.eftv53ln6zcqv4tu@valoup> On 09.11.2018 11:11, Florian Bruhin wrote: > That /var/folders/.../ stuff seems to be the macOS equivalent to what > /tmp is on Linux ;) Hmm, don't think so. In macOS, there is a /tmp symlink, pointing to a /private/tmp folder, which is the actual equivalent of tmp I believe. -- Valentin