Commit Graph

628 Commits (c9a9ccf8a35e157e22afeaafc2851176ddd87e68)

Author SHA1 Message Date
Glenn Slayden c9a9ccf8a3
URL batch listing file improvements
These improvements apply to reading the list of URLs from the file supplied via the `--batch-file` (`-a`) command line option.

1. Skip blank and empty lines in the file. Currently, lines with leading whitespace are only skipped when that whitespace is followed by a comment character (`#`, `;`, or `]`). This means that empty lines and lines consisting only of whitespace are returned as (trimmed) empty strings in the list of URLs to process.

2. [bug fix] Detect and remove the Unicode BOM when the file descriptor is already decoding Unicode.

With Python 3, the `batch_fd` enumerator returns the lines of the file as Unicode. For UTF-8, this means that the raw BOM bytes from the file `\xef \xbb \xbf` show up converted into a single `\ufeff` character prefixed to the first enumerated text line. 

This fix solves several buggy interactions between the presence of BOM, the skipping of comments and/or blank lines, and ensuring the list of URLs is consistently trimmed. For example, if the first line of the file is blank, the BOM is incorrectly returned as a URL standing alone. If the first line contains a URL, it will be prefixed with this unwanted single character--but note that its being there will have inhibited the proper trimming of any leading whitespace. Currently, the `UnicodeBOMIE` helper attempts to recover from some of these error cases, but this fix prevents the error from happening in the first place (at least on Python3). In any case, the `UnicodeBOMIE` approach is flawed, because it is clearly illogical for a BOM to appear in the (non-batch) URL(s) specified directly on the command line (and for that matter, on URLs *after the first line* of a batch list, also)

3. Having fixed `read_batch_urls` so that it more consistently enumerates only properly trimmed URLs, it can also do a quick on-the-fly elimination of exact duplicates (of course doing so without disturbing the order in which they are listed).
4 years ago
Sergey M․ 1d9bf655e6
[utils] Recognize wav mimetype (closes #26463) 4 years ago
Rob 9cd5f54e31
[utils] Fix file permissions in write_json_file (closes #12471) (#25122) 4 years ago
Sergey M․ c380cc28c4
[utils] Improve cookie files support
+ Add support for UTF-8 in cookie files
* Skip malformed cookie file entries instead of crashing (invalid entry len, invalid expires at)
4 years ago
Sergey M․ f1a8511f7b
[utils] Add reference to cookie file format 4 years ago
Sergey M․ 042b664933
Revert "[utils] Add support for cookies with spaces used instead of tabs"
According to [1] TABs must be used as separators between fields.
Files produces by some tools with spaces as separators are considered
malformed.

1. https://curl.haxx.se/docs/http-cookies.html

This reverts commit cff99c91d1.
4 years ago
Sergey M․ cff99c91d1
[utils] Add support for cookies with spaces used instead of tabs 4 years ago
Sergey M․ fca6dba8b8
[YoutubeDL] Force redirect URL to unicode on python 2 4 years ago
Sergey M․ 42db58ec73
[utils] Improve str_to_int 5 years ago
Remita Amine 348c6bf1c1 [utils] handle int values passed to str_to_int 5 years ago
Sergey M․ 1ced222120
[utils] Add generic caesar cipher and rot47 5 years ago
InfernalUnderling 9d30c2132a [utils] Handle rd-suffixed day parts in unified_strdate (#23199) 5 years ago
Sergey M․ 53896ca5be
[utils] Actualize major IPv4 address blocks per country 5 years ago
Sergey M․ 824fa51165
[utils] Improve subtitles_filename (closes #22753) 5 years ago
Sergey M․ f7a147e3b6
[utils] Introduce random_user_agent and use as default User-Agent (closes #21546) 5 years ago
Sergey M․ 28cc2241e4
[utils] Restrict parse_codecs and add theora as known vcodec (#21381) 5 years ago
Sergey M․ 53cd37bac5
[utils] Improve strip_or_none 5 years ago
Sergey M․ 3089bc748c
Fix W504 and disable W503 (closes #20863) 5 years ago
Jakub Wilk fd35d8cdfd [utils] Transliterate "þ" as "th" (#20897)
Despite visual similarity "þ" is unrelated to "p".
It is normally transliterated as "th":

    $ echo þ-Þ | iconv -t ASCII//TRANSLIT
    th-TH
5 years ago
Sergey M․ 5e1271c56d
[utils] Improve int_or_none and float_or_none (#20403) 5 years ago
Sergey M․ 0dc41787af
[utils] Introduce parse_bitrate 5 years ago
Sergey M․ 067aa17edf
Start moving to ytdl-org 5 years ago
remitamine e7e62441cd [utils] strip #HttpOnly_ prefix from cookies files (#20219) 5 years ago
Ales Jirasek 22f5f5c6fc
[malltv] Add extractor (closes #18058) 5 years ago
Sergey M․ fad4ceb534
[utils] Fix urljoin for paths with non-http(s) schemes 5 years ago
Sergey M․ e9a50fba86
[utils] Fix typo 6 years ago
Sergey M․ b7acc83550
[utils] Add language codes replaced in 1989 revision of ISO 639 to ISO639Utils (closes #18765) 6 years ago
Sergey M․ 1bab343704
[YoutubeDL] Introduce YoutubeDLCookieJar and clarify the rationale behind session cookies (closes #12929) 6 years ago
Alexander Seiler aa374bc78e [utils] Fix random_birthday to generate existing dates only 6 years ago
Sergey M․ 25d110be30
[utils] Properly recognize AV1 codec (closes #17506) 6 years ago
Sergey M․ 9e21e6d96b
[utils] Improve remote address skipping and add support for python 2.6 (closes #17362) 6 years ago
Andrew Udvare 8959018a5f
[utils] Skip remote IP addresses non matching to source address' IP version (closes #13422) 6 years ago
Sergey M․ 60c0856223
[utils] Use pure browser header for User-Agent (closes #17236) 6 years ago
Huyuumi 38e87f6c2a [utils] Remove return from __init__ 6 years ago
Sergey M․ af03000ad5
[utils] Introduce url_or_none 6 years ago
Sergey M․ e9c671d5e8
[utils] Allow JSONP with empty func name (closes #17028) 6 years ago
Sergey M․ 0685d9727b
[utils] Share JSON-LD regex 6 years ago
Enes 85750f8972 [openload] Improve ext extraction 6 years ago
Remita Amine 261f47306c [utils] fix style id extraction for namespaced id attribute(closes #16551) 6 years ago
Remita Amine 5a16c9d9d3 [utils] keep the original TV_PARENTAL_GUIDELINES dict 6 years ago
Remita Amine b836118724 [utils] Relax TV Parental Guidelines matching 6 years ago
Sergey M․ 5f95927a62
Improve geo bypass mechanism
* Introduce geo bypass context
* Add ability to bypass based on IP blocks in CIDR notation
* Introduce --geo-bypass-ip-block
6 years ago
Sergey M․ 6cc622327f
[utils] Introduce merge_dicts 6 years ago
Sergey M․ 1cc47c6674
[utils] Fix match_str for boolean meta fields 6 years ago
Philipp Hagemeister f226880c6d [tennistv] Add support for tennistv.com 6 years ago
Sergey M․ b871d7e954
[utils] Add parse_resolution 6 years ago
Sergey M․ befa4708fd
[utils] Fixup some common URL's typos in sanitize_url (closes #15649) 6 years ago
Remita Amine b12cf31bb1 [cbc] add new extractor for olympics.cbc.ca(closes #15535) 6 years ago
Sergey M․ 65220c3bd6
Add support for IronPython 6 years ago
Mike Fährmann c384d537f8 [util] Improve scientific notation handling in js_to_json (closes #14789) 6 years ago