From c9a9ccf8a35e157e22afeaafc2851176ddd87e68 Mon Sep 17 00:00:00 2001
From: Glenn Slayden <5589855+glenn-slayden@users.noreply.github.com>
Date: Sun, 4 Oct 2020 22:54:59 -0700
Subject: [PATCH] URL batch listing file improvements

These improvements apply to reading the list of URLs from the file supplied via the `--batch-file` (`-a`) command line option.

1. Skip blank and empty lines in the file. Currently, lines with leading whitespace are only skipped when that whitespace is followed by a comment character (`#`, `;`, or `]`). This means that empty lines and lines consisting only of whitespace are returned as (trimmed) empty strings in the list of URLs to process.

2. [bug fix] Detect and remove the Unicode BOM when the file descriptor is already decoding Unicode.

With Python 3, the `batch_fd` enumerator returns the lines of the file as Unicode. For UTF-8, this means that the raw BOM bytes from the file `\xef \xbb \xbf` show up converted into a single `\ufeff` character prefixed to the first enumerated text line.

This fix solves several buggy interactions between the presence of BOM, the skipping of comments and/or blank lines, and ensuring the list of URLs is consistently trimmed. For example, if the first line of the file is blank, the BOM is incorrectly returned as a URL standing alone. If the first line contains a URL, it will be prefixed with this unwanted single character--but note that its being there will have inhibited the proper trimming of any leading whitespace. Currently, the `UnicodeBOMIE` helper attempts to recover from some of these error cases, but this fix prevents the error from happening in the first place (at least on Python3). In any case, the `UnicodeBOMIE` approach is flawed, because it is clearly illogical for a BOM to appear in the (non-batch) URL(s) specified directly on the command line (and for that matter, on URLs *after the first line* of a batch list, also)

3. Having fixed `read_batch_urls` so that it more consistently enumerates only properly trimmed URLs, it can also do a quick on-the-fly elimination of exact duplicates (of course doing so without disturbing the order in which they are listed).
---
 youtube_dl/utils.py | 15 +++++++++++----
 1 file changed, 11 insertions(+), 4 deletions(-)

diff --git a/youtube_dl/utils.py b/youtube_dl/utils.py
index 01d9c0362..608586b77 100644
--- a/youtube_dl/utils.py
+++ b/youtube_dl/utils.py
@@ -3879,16 +3879,23 @@ def escape_url(url):
 
 
 def read_batch_urls(batch_fd):
+    seen = set()
     def fixup(url):
         if not isinstance(url, compat_str):
             url = url.decode('utf-8', 'replace')
         BOM_UTF8 = '\xef\xbb\xbf'
         if url.startswith(BOM_UTF8):
             url = url[len(BOM_UTF8):]
-        url = url.strip()
-        if url.startswith(('#', ';', ']')):
-            return False
-        return url
+        if url:
+            if url[0] == '\ufeff':
+                url = url[1:]
+            url = url.lstrip()
+            if url and not url[0] in ('#', ';', ']'):
+                url = url.split('#', 1)[0].rstrip()
+                if not url in seen:
+                    seen.add(url)
+                    return url
+        return False
 
     with contextlib.closing(batch_fd) as fd:
         return [url for url in map(fixup, fd) if url]