my tech blog » perl

Perl one-liner for adding newlines to HTML

eli — Thu, 12 Mar 2026 11:35:32 +0000

When the rich editor puts all HTML in one line, and I want to edit it, I could always use the “tidy” utility, however it does too much. All I want is a newline here and there to make the whole thing accessible.

So this simple one-liner does the job:

perl -pe 's/(<\/(?:p|h\d|div|tr|td|table|ul|ol|li)>)/"$1\n"/ge'

Not perfect, but gives something to work with.

Converting vtt to srt subtitles with a simple Perl script

eli — Fri, 19 Sep 2025 16:45:46 +0000

I tried to use ffmpeg to convert an vtt file to srt, but that didn’t work at all:

$ ffmpeg -i in.vtt out.srt
Output file is empty, nothing was encoded (check -ss / -t / -frames parameters if used)

I tried a whole lot of suggestions from the Internet, and eventually I gave up.

So I wrote a simple Perl script to get the job done. It took about 20 minutes, because I made a whole lot of silly mistakes:

#!/usr/bin/perl

use warnings;
use strict;

my $n = 1;
my $l;

my $timestamp_regex = qr/[0-9]+:[0-9]+:[0-9:\.]+/; # Very permissive

while (defined ($l = <>)) {
  my ($header) = ($l =~ /^($timestamp_regex --> $timestamp_regex)/);
  next unless (defined $header);

  $header =~ s/\./,/g;

  print "$n\n";
  print "$header\n";

  $n++;

  while  (defined ($l = <>)) {
    last unless ($l =~ /[^ \t\n\r]/); # Nothing but possibly whitespaces

    print $l;
  }
  print "\n";
}

$n--;
print STDERR "Converted $n subtitles\n";

Maybe not a piece of art, and it can surely be made more accurate, but it does the job with simply

$ ./vtt2srt.pl in.vtt > out.srt
Converted 572 subtitles

And here’s why Perl is a pearl.

Perl script for mangling SRT subtitle files

eli — Tue, 21 May 2024 09:06:03 +0000

I had a set of SRT files with pretty good subtitles, but with one annoying problem: When there was a song in the background, the translation of the song would pop up and interrupt of the dialogue’s subtitles, so it became impossible to understand what’s going on.

Luckily, those song-translating subtitles had all have a “{\a6}” string, which is an ASS tag meaning that the text should be shown at the top of the picture. mplayer ignores these tags, which explains why these subtitles make sense, but mess up things for me. So the simple solution is to remove these entries.

Why don’t I use VLC instead? Mainly because I’m used to mplayer, and I’m under the impression that mplayer gives much better and easier control of low-level issues such as adjusting the subtitles’ timing. But also the ability to run it with a lot of parameters from the command line and jumping back and forth in the displayed video, in particular through a keyboard remote control. But maybe it’s just a matter of habit.

Here’s a Perl script that reads an SRT file and removes all entries with such string. It fixes the numbering of the entries to make up for those that have been removed. Fun fact: The entries don’t need to appear in chronological order. In fact, most of the annoying subtitles appeared at the end of the file, even though they messed up things everywhere.

This can be a boilerplate for other needs as well, of course.

#!/usr/bin/perl
use warnings;
use strict;

my $fname = shift;

my $data = readfile($fname);

my ($name, $ext) = ($fname =~ /^(.*)\.(.*)$/);

die("No extension in file name \"$fname\"\n")
  unless (defined $name);

# Regex for a newline, swallowing surrounding CR if such exist
my $nl = qr/\r*\n\r*/;

# Regex for a subtitle entry
my $tregex = qr/(?:\d+$nl.*?(?:$nl$nl|$))/s;

my ($pre, $chunk, $post) = ($data =~ /^(.*?)($tregex*)(.*)$/);

die("Input file doesn't look like an SRT file\n")
  unless (defined $chunk);

my $lpre = length($pre);
my $lpost = length($post);

print "Warning: Passing through $lpre bytes at beginning of file untouched\n"
 if ($lpre);

print "Warning: Passing through $lpost bytes at beginning of file untouched\n"
 if ($lpost);

my @items = ($chunk =~ /($tregex)/g);

#### This is the mangling part

my @outitems;
my $removed = 0;
my $counter = 1;

foreach my $i (@items) {
  if ($i =~ /\\a6/) {
    $removed++;
  } else {
    $i =~ s/\d+/$counter/;
    $counter++;
    push @outitems, $i;
  }
}

print "Removed $removed subtitle entries from $fname\n";

#### Mangling part ends here

writefile("$name-clean.$ext", join("", $pre, @outitems, $post));

exit(0); # Just to have this explicit

############ Simple file I/O subroutines ############

sub writefile {
  my ($fname, $data) = @_;

  open(my $out, ">:utf8", $fname)
    or die "Can't open \"$fname\" for write: $!\n";
  print $out $data;
  close $out;
}

sub readfile {
  my ($fname) = @_;

  local $/; # Slurp mode

  open(my $in, "<:utf8", $fname)
    or die "Can't open $fname for read: $!\n";

  my $input = <$in>;
  close $in;

  return $input;
}

Perl: “$” doesn’t really mean end of string

eli — Mon, 20 Mar 2023 02:59:26 +0000

Who ate my newline?

It’s 2023, Perl is ranked below COBOL, but I still consider it as my loyal working horse. But even the most loyal horse will give you a grand kick in the bottom every now and then.

So let’s jump to the problematic code:

#!/usr/bin/perl
use warnings;
use strict;

my $str = ".\n\n";

my $nonn = qr/[ \t]|(?;

my ($pre, $match, $post) = ($str =~ /^($nonn*)(.*?)($nonn*)$/s);

print "pre = \"$pre\"\n";
print "match = \"$match\"\n";
print "post = \"$post\"\n";

print "This doesn't add up!\n"
  unless ($str eq "$pre$match$post");

For now, never mind what I tried to do here. Let’s just note that $nonn doesn’t capture anything: Those two expressions with parentheses are a lookbehind and a lookahead, and hence don’t capture.

So now let’s look at

my ($pre, $match, $post) = ($str =~ /^($nonn*)(.*?)($nonn*)$/s);

This is an enclosure between ^ and $, and everything in the middle is captured into three matches. So no matter what, the concatenation of these three matches should equal $str, shouldn’t it? Let’s give it a test run:

$ ./try.pl
pre = ""
match = ".
"
post = ""
This doesn't add up!

So $pre and $post are empty. OK, fine. Hence $match should equal $str, which is “.\n\n”. But I see only one newline. Where’s the other one?

RTFM

The one thing that I really like about Perl, is that even when it plays a dirty trick, the answer is in the plain manual. As in “man perlre”, where it says, black on white in the description of $:

Match the end of the string (or before newline at the end of the string; or before any newline if /m is used)

So there we have it. “$” can also consider the character before the last newline as the end. Note that “$” itself will not match the last newline, so even if there’s a capture on the “$” itself, as in “($)”, that last newline is still not captured. It’s a Perl quirk. One of those things that make Perl do exactly what you really want, except for when you’re surgical about it.

I’ve been using Perl a lot for 20 years, and I wasn’t aware that “$” could match anything but the end of the string (let alone the “/m” modifier).

So that’s what happened above: $ considered the character before the last newline to be the end, and one newline went up in smoke.

Use \z instead

The second thing that I really like about Perl, is that even when it’s quirky, there’s always a simple solution. The same “man perlre” also says:

\z Match only at end of string

Simple, isn’t it? From now on and until the end of time, always use \z if you really mean the end of string. Like, character-wise. And if I change “$” to “\z” in the code above, I get:

my ($pre, $match, $post) = ($str =~ /^($nonn*)(.*?)($nonn*)\z/s);

and the test run gives:

$ ./try.pl
pre = ""
match = ".

"
post = ""

The working horse is back on track again.

What I really wanted to do

Since I messed up with this regex, I should maybe explain what it does:

my $nonn = qr/[ \t]|(?;

First, let’s note that $nonn only matches one character (or none): It’s either a plain space, a tab or a newline. But what’s the mess with the newline?

The “(?

No double \n. Or for short, “nonn”.

I needed this for a script that handles multiple newlines later on (in LaTeX, a double newline means a new paragraph, that’s the reason).

And it actually worked. The “\n\n” part in the string wasn’t matched into neither $pre nor $post. But the (.*?), which attempts to match as little as possible, sold off the last newline to $. Tricky stuff.

Using git send-email with Gmail + OAUTH2, but without subscribing to cloud services

eli — Sun, 30 Oct 2022 09:08:44 +0000

Introduction

There is a widespread belief, that in order to use git send-email with Gmail, there’s a need to subscribe to Google Cloud services and obtain some credentials. Or that a two-factor authentication (2fa) is required.

This is not the case, however. If Thunderbird can manage to fetch and send emails through Google’s mail servers (as well as other OAUTH2 authenticated mail services), there’s no reason why a utility won’t be able to do the same.

The subscription to Google’s services is indeed required if the communication with Google’s server must be done without human supervision. That’s the whole point with API keys. If a human is around when the mail is dispatched, there’s no need for any special measures. And it’s quite obvious that there’s a responsive human around when a patch is being submitted.

What is actually needed, is a client ID and a client secret, and these are indeed obtained by registering to Google’s cloud service (this explains how). But here’s the thing: Someone at Mozilla has already obtained these, and hardcoded them into Thunderbird itself. So there’s no problem using these to access Gmail with another mail client. It seems like many believe that the client ID and secret must be related to the mail account to access, and therefore each and every one has to obtain their own pair. That’s a mistake that has made a lot of people angry for nothing.

This post describes how to use git send-email without any further involvement with Google, except for having a Gmail account. The same method surely applies for other mail service providers that rely on OAUTH2, but I haven’t gotten into that. It should be quite easy to apply the same idea to other services as well however.

For this to work, Thunderbird must be configured to access the same email account. This doesn’t mean that you actually have to use Thunderbird for your mail exchange. It’s actually enough to configure the Gmail server as an outgoing mail server for the relevant account. In other words, you don’t even need to fetch mails from the server with Thunderbird.

The point is to make Thunderbird set up the OAUTH2 session, and then fetch the relevant piece of credentials from it. And take it from there with Google’s servers. Thunderbird is a good candidate for taking care of the session’s setup, because the whole idea with OAUTH2 is that the user / password session (plus possible additional authentication challenges) is done with a browser. Since Thunderbird is Firefox in disguise, it integrates the browser session well into its general flow.

If you want to use another piece of software to maintain the OAUTH2 session, that’s most likely possible, given that you can get its refresh token. This will also require obtaining its client ID and client secret. Odds are that it can be found somewhere in that software’s sources, exactly as I found it for Thunderbird. Or look at the https connection it runs to get an access token (which isn’t all that easy, encryption and that).

Outline of solution

All below relates to Linux Mint 19, Thunderbird 91.10.0, git version 2.17.1, Perl 5.26 and msmtp 1.8.14. But except for Thunderbird and msmtp, I don’t think the versions are going to matter.

It’s highly recommended to read through my blog post on OAUTH2, in particular the section called “The authentication handshake in a nutshell”. You’re going to need to know the difference between an access token and a refresh token sooner or later.

So the first obstacle is the fact that git send-email relies on the system’s sendmail to send out the emails. That utility doesn’t support OAUTH2 at the time of writing this. So instead, I used msmtp, which is a drop-in replacement for sendmail, plus it supports OAUTH2 (since version 1.8.13).

msmtp identifies itself to the server by sending it an access token in the SMTP session (see a dump of a sample session below). This access token is short-lived (3600 seconds from Google as of writing this), so it can’t be fetched from Thunderbird just like that. In particular because most of the time Thunderbird doesn’t have it.

What Thunderbird does have is a refresh token. It’s a completely automatic task to ask Google’s server for the access token with the refresh token at hand. It’s also an easy task (once you’ve figured out how to do it, that is). It’s also easy to get the refresh token from Thunderbird, exactly in the same way as getting a saved password. In fact, Thunderbird treats the refresh token as a password.

msmtp allows executing an arbitrary program in order to get the password or the access token. So I wrote a Perl script (oauth2-helper.pl) that reads the refresh token from a file and gets an access token from Google’s server. This is how msmtp manages to authenticate itself.

So everything relies on this refresh token. In principle, it can change every time it’s used. In practice, as of today, Google’s servers don’t change it. It seems like the refresh token is automatically replaced every six months, but even if that’s true today, it may change.

But that doesn’t matter so much. All that is necessary is that the refresh token is correct once. If the refresh token goes out of sync with Google’s server, a simple user / password session rectifies this. And as of now, than virtually never happens.

So let’s get to the hands-on part.

Install msmtp

Odds are that your distribution offers msmtp, so it can be installed with something like

# apt install msmtp

Note however that the version needs to be at least 1.8.13, which wasn’t my case (Linux Mint 19). So I installed it from the sources. To do that, first install the TLS library, if it’s not installed already (as root):

# apt install gnutls-dev

Then clone the git repository, compile and install:

$ GIT_SSL_NO_VERIFY=true git clone http://git.marlam.de/git/msmtp.git
$ cd msmtp
$ git checkout msmtp-1.8.14
$ autoreconf -i
$ ./configure
$ make && echo Success
$ sudo make install

The installation goes to /usr/local/bin and other /usr/local/ paths, as one would expect.

I checked out version 1.8.14 because later versions failed to compile on my Linux Mint 19. OAUTH2 support was added in 1.8.13, and judging by the commit messages it hasn’t been changed since, except for commit 1f3f4bfd098, which is “Send XOAUTH2 in two lines, required by Microsoft servers”. Possibly cherry-pick this commit (I didn’t).

Once everything has been set up as described below, it’s possible to send an email with

$ msmtp -v -t < ~/email.eml

The -v flag is used only for debugging, and it prints out the entire SMTP session.

The -t flag tells msmtp to fetch the recipients from the mail’s own headers. Otherwise, the recipients need to be listed in the command line, just like sendmail. Without this flag or recipients, msmtp just replies with

msmtp: no recipients found

The -t flag isn’t necessary with git send-email, because it explicitly lists the recipients in the command line.

The oauth2-helper.pl script

As mentioned above, Thunderbird has the refresh token, but msmtp needs an access token. So the script that talks with Google’s server and grabs the access token can be downloaded from its Github repo. Save it, with execution permission to /usr/local/bin/oauth2-helper.pl (or whatever, but this is what I assume in the configurations below).

Some Perl libraries may be required to run this script. On a Debian-based system, the packages’ names are probably something like libhttp-message-perl, libwww-perl and libjson-perl.

It’s written to access Google’s token server, but can be modified easily to access a different service provider by changing the parameters at its beginning. For other email providers, check if it happens to be listed in OAuth2Providers.sys.mjs. I don’t know how well it will work with those other providers, though.

The script reads the refresh token from ~/.oauth2_reftoken as a plain file containing the blob only. There’s an inherent security risk of having this token stored like this, but it’s basically the same risk as the fact that it can be obtained from Thunderbird’s credential files. The difference is the amount of security by obscurity. Anyhow, the reference token isn’t your password, and it can’t be derived from it. Either way, make sure that this file has a 0600 or 0400 permission, if you’re running on a multi-user computer.

The script caches the access token in ~/.oauth2_acctoken, with an expiration timestamp. As of today, it means that the script talks with the Google’s server once in 60 minutes at most.

Setting up config files

So with msmtp installed and the script downloaded into /usr/local/bin/oauth2-helper.pl, all that is left is configuration files.

First, create ~/.msmtprc as follows (put your Gmail username instead of mail.username, of course):

account default
host smtp.gmail.com
port 587
tls on
tls_starttls on
auth xoauth2
user mail.username
passwordeval /usr/local/bin/oauth2-helper.pl
from mail.username@gmail.com

And then change the [sendemail] section in ~/.gitconfig to

[sendemail]
        smtpServer = /usr/local/bin/msmtp

That’s it. Only that single line. It’s however possible to use smtpServerOption in the .gitconfig to add various flags. So for example, to get the entire SMTP session shown while sending the email, it should say:

[sendemail]
        smtpServer = /usr/local/bin/msmtp
        smtpServerOption = -v

But really, don’t, unless there’s a problem sending mails.

Other than that, don’t keep old settings. For example, there should not be a “from=” entry in .gitconfig. Having such causes a “From:” header to be added into the mail body (so it’s visible to the reader of the mail). This header is created when there is a difference between the “From” that is generated by git send-email (which is taken from the “from=” entry) and the patch’ author, as it appears in the patch’ “From” header. The purpose of this in-body header is to tell “git am” who the real author is (i.e. not the sender of the patch). So this extra header won’t appear in the commit, but it nevertheless makes the sender of the message look somewhat clueless.

So in short, no old junk.

Sending a patch

Unless it’s the first time, I suggest just trying to send the patch to your own email address, and see if it works. There’s a good chance that the refresh token from the previous time will still be good, so it will just work, and no point hassling more.

Actually, it’s fine to try like this even on the first time, because the Perl script will fail to grab the access token and then tell you what to do to fix it, namely:

Make sure that Thunderbird has access to the mail account itself, possibly by attempting to send an email through Gmail’s server.
Go to Thunderbird’s Preferences > Privacy & Security and click on Saved Passwords. Look for the account, where the Provider start with oauth://. Right-click that line and choose “Copy Password”.
Create or open ~/.oauth2_reftoken, and paste the blob into that file, so it contains only that string. No need to be uptight with newlines and whitespaces: They are ignored.

And then go, as usual:

$ git send-email --to 'my@test.mail' 0001-my.patch

I’ve added the output of a successful session (with the -v flag) below.

Room for improvements

It would have been nicer to fetch the refresh token automatically from Thunderbird’s credentials store (that is from logins.json, based upon the decryption key that is kept in key4.db), but the available scripts for that are written in Python. And to me Python is equal to “will cause trouble sooner or later”. Anyhow, this tutorial describes the mechanism (in the part about Firefox).

Besides, it could have been even nicer if the script was completely standalone, and didn’t depend on Thunderbird at all. That requires doing the whole dance with the browser, something I have no motivation to get into.

A successful session

This is what it looks like when a patch is properly sent, with the smtpServerOption = -v line in .gitignore (so msmtp produces verbose output):

Send this email? ([y]es|[n]o|[q]uit|[a]ll): y
ignoring system configuration file /usr/local/etc/msmtprc: No such file or directory
loaded user configuration file /home/eli/.msmtprc
falling back to default account
Fetching access token based upon refresh token in /home/eli/.oauth2_reftoken...
using account default from /home/eli/.msmtprc
host = smtp.gmail.com
port = 587
source ip = (not set)
proxy host = (not set)
proxy port = 0
socket = (not set)
timeout = off
protocol = smtp
domain = localhost
auth = XOAUTH2
user = mail.username
password = *
passwordeval = /usr/local/bin/oauth2-helper.pl
ntlmdomain = (not set)
tls = on
tls_starttls = on
tls_trust_file = system
tls_crl_file = (not set)
tls_fingerprint = (not set)
tls_key_file = (not set)
tls_cert_file = (not set)
tls_certcheck = on
tls_min_dh_prime_bits = (not set)
tls_priorities = (not set)
tls_host_override = (not set)
auto_from = off
maildomain = (not set)
from = mail.username@gmail.com
set_from_header = auto
set_date_header = auto
remove_bcc_headers = on
undisclosed_recipients = off
dsn_notify = (not set)
dsn_return = (not set)
logfile = (not set)
logfile_time_format = (not set)
syslog = (not set)
aliases = (not set)
reading recipients from the command line
<-- 220 smtp.gmail.com ESMTP m8-20020a7bcb88000000b003c6d21a19a0sm3316430wmi.29 - gsmtp
--> EHLO localhost
<-- 250-smtp.gmail.com at your service, [109.186.183.118]
<-- 250-SIZE 35882577
<-- 250-8BITMIME
<-- 250-STARTTLS
<-- 250-ENHANCEDSTATUSCODES
<-- 250-PIPELINING
<-- 250-CHUNKING
<-- 250 SMTPUTF8
--> STARTTLS
<-- 220 2.0.0 Ready to start TLS
TLS session parameters:
    (TLS1.2)-(ECDHE-ECDSA-SECP256R1)-(CHACHA20-POLY1305)
TLS certificate information:
    Subject:
        CN=smtp.gmail.com
    Issuer:
        C=US,O=Google Trust Services LLC,CN=GTS CA 1C3
    Validity:
        Activation time: Mon 26 Sep 2022 11:22:04 AM IDT
        Expiration time: Mon 19 Dec 2022 10:22:03 AM IST
    Fingerprints:
        SHA256: 53:F3:CA:1D:37:F2:1F:ED:2C:67:40:A2:A2:29:C2:C8:E8:AF:9E:60:7A:01:92:EC:F0:2A:11:E8:37:A5:88:F3
        SHA1 (deprecated): D4:69:6E:59:2D:75:43:59:02:74:25:67:E7:57:40:E0:28:43:A8:62
--> EHLO localhost
<-- 250-smtp.gmail.com at your service, [109.186.183.118]
<-- 250-SIZE 35882577
<-- 250-8BITMIME
<-- 250-AUTH LOGIN PLAIN XOAUTH2 PLAIN-CLIENTTOKEN OAUTHBEARER XOAUTH
<-- 250-ENHANCEDSTATUSCODES
<-- 250-PIPELINING
<-- 250-CHUNKING
<-- 250 SMTPUTF8
--> AUTH XOAUTH2 dXNlcj1lbGkuYmlsbGF1ZXIBYXV0aD1CZWFyZXIgeWEyOS5hMEFhNHhyWE1GM1gtOTJMVWNidjE4MFdVOBROENRcUdSbk5KaUFSY0VSckVaXzdzbDlHMTNpdFIyUTk0NjlKWG45aHVGLQVRBU0FSTVXJpSjRqMjBLcWh6WU9GekxlcU5BYVpFNUU4WXRhNjdLUXpCRm1HRDg3dFgzeHJ4amNPTnRVTkZFVWdESXhsUlcxOFhVT0pqQ1hPSlFwZlNGUUVqRHZMOWw4RExkTjlKZlNbGRTazNNbFNMNjVfQWFDZ1lLVVF2Y0luOWNSSUEwMTY2AQE=
<-- 235 2.7.0 Accepted
--> MAIL FROM:
--> RCPT TO:
--> RCPT TO:
--> DATA
<-- 250 2.1.0 OK m8-20020a7bcb88000000b003c6d21a19a0sm3316430wmi.29 - gsmtp
<-- 250 2.1.5 OK m8-20020a7bcb88000000b003c6d21a19a0sm3316430wmi.29 - gsmtp
<-- 250 2.1.5 OK m8-20020a7bcb88000000b003c6d21a19a0sm3316430wmi.29 - gsmtp
<-- 354  Go ahead m8-20020a7bcb88000000b003c6d21a19a0sm3316430wmi.29 - gsmtp
--> From: Eli Billauer 
--> To: test@mail.com
--> Cc: Eli Billauer 
--> Subject: [PATCH v8] Gosh! Why don't you apply this patch already!
--> Date: Sun, 30 Oct 2022 07:01:14 +0200
--> Message-Id: <20221030050114.49299-1-mail.username@gmail.com>
--> X-Mailer: git-send-email 2.17.1
--> 

[ ... email body comes here ... ]

--> --
--> 2.17.1
-->
--> .
<-- 250 2.0.0 OK  1667106108 m8-20020a7bcb88000000b003c6d21a19a0sm3316430wmi.29 - gsmtp
--> QUIT
<-- 221 2.0.0 closing connection m8-20020a7bcb88000000b003c6d21a19a0sm3316430wmi.29 - gsmtp
OK. Log says:
Sendmail: /usr/local/bin/msmtp -v -i test@mail.com mail.username@gmail.com
From: Eli Billauer 
To: test@mail.com
Cc: Eli Billauer 
Subject: [PATCH v8] Gosh! Why don't you apply this patch already!
Date: Sun, 30 Oct 2022 07:01:14 +0200
Message-Id: <20221030050114.49299-1-mail.username@gmail.com>
X-Mailer: git-send-email 2.17.1

Result: OK

Ah, and the fact that the access token can be copied from here is of course meaningless, as it has expired long ago.

Thunderbird debug notes

These are some random notes I made while digging in Thunderbird’s guts to find out what’s going on.

So this is Thunderbird’s official git repo. Not that I used it.

To get logging info from Thunderbird: Based upon this page, go to Thunderbird’s preferences > General and click the Config Editor button. Set mailnews.oauth.loglevel to All (was Warn). Same with mailnews.smtp.loglevel. Then open the Error Console with Ctrl+Shift+J.

The cute thing about these logs is that the access code is written in the log. So it’s possible to skip the Perl script, and use the access code from Thunderbird’s log. Really inconvenient, but possible.

The OAuth2 token requests is implemented in Oauth2.jsm. It’s possible to make a breakpoint in this module by through Tools > Developer Tools > Developer Toolbox, and once it opens (after requesting permission for external connection), go to the debugger.

Find Oauth2.jsm in the sources pane to the left (of the Debugger tab), under resource:// modules > sessionstore. Add a breakpoint in requestAccessToken() so that the clientID and consumerSecret properties can be revealed.

Sending a patch from Thunderbird directly

This is a really bad idea. But if you have Thunderbird, and need to send a patch right now, this is a quick, dirty and somewhat dangerous procedure for doing that.

Why is it dangerous? Because at some point, it’s easy to pick “Send now” instead of “Send later”, and boom, a junk patch is mailed to the whole world.

The problem with Thunderbird is that it makes small changes into the patch’ body. So to work around this, there’s a really silly procedure. I used it once, and I’m not proud of that.

So here we go.

First, a very simple script that outputs the patch mail into a file. Say that I called it dumpit (should be executable, of course):

#!/bin/bash

cat > /home/eli/Desktop/git-send-email.eml

Then change ~/.gitconfig, so it reads something like this in the [sendemail] section:

[sendemail]
        from = mail.username@gmail.com
        smtpServer = /home/eli/Desktop/dumpit

So basically it uses the silly script as a mail server, and the content goes out to a plain file.

Then run git send-email as usual. The result is a git-send-email.eml as a file.

And now comes the part of making Thunderbird send it.

Close Thunderbird. All windows.
Change directory to where Thunderbird keeps its profile files, to under Mail/Local Folders
Remove “Unsent Messages” and “Unsent Messages.msf”
Open Thunderbird again
Inside Thunderbird, go to Hamburger Icon > File > Open > Saved Message… and select git-send-email.eml. The email message should appear.
Right-Click somewhere in the message’s body, and pick Edit as New Message…
Don’t send this message as is! It’s completely messed up. In particular, there are some indentations in the patch itself, which renders it useless.
Instead, pick File > Send Later.
Once again, close Thunderbird. All windows.
Remove “Unsent Messages.msf” (only)
Edit “Unsent Messages” as follows: Everything under the “Content-Transfer-Encoding: 7bit” part is the mail’s body. So remove the “From:” line after it, and paste the email’s body from git-send-email.eml instead.
Note that there are normally two blank lines after the mail’s body. Retain them.
Open Thunderbird again. Verify that those indentations are away.
Look at the mail inside Outbox, and verify that it’s OK now. These are the three things to look for in particular:
- The “From:” part at the beginning of the message is gone.
- At the end of the message, there’s a “–” and git’s version number. These should be in separate lines.
- Look at the mail’s source. The “+” and “-” signs of the diffs must not be indented.
If all is fine, right-click Outbox, and pick “Send unsent messages”. And hope for good.

Are you sure you want to do this?

Blocking bots by their IP addresses, the DIY version

eli — Tue, 16 Aug 2022 10:26:37 +0000

Introduction

I had some really annoying bots on one of my websites. Of the sort that make a million requests (like really, a million) per month, identifying themselves as a browser.

So IP blocking it is. I went for a minimalistic DIY approach. There are plenty of tools out there, but my experience with things like this is that in the end, it’s me and the scripts. So I might as well write them myself.

The IP set feature

Iptables has an IP set module, which allows feeding it with a set of random IP addresses. Internally, it creates a hash with these addresses, so it’s an efficient way to keep track of multiple addresses.

IP sets has been in the kernel since ages, but it has to be opted in the kernel with CONFIG_IP_SET. Which it most likely is.

The ipset utility may need to be installed, with something like

# apt install ipset

There seems to be a protocol mismatch issue with the kernel, which apparently is a non-issue. But every time something goes wrong with ipset, there’s a warning message about this mismatch, which is misleading. So it looks something like this.

# ipset [ ... something stupid or malformed ... ]
ipset v6.23: Kernel support protocol versions 6-7 while userspace supports protocol versions 6-6
[ ... some error message related to the stupidity ... ]

So the important thing is to be aware of is that odds are that the problem isn’t the version mismatch, but between chair and keyboard.

Hello, world

A quick session

# ipset create testset hash:ip
# ipset add testset 1.2.3.4
# iptables -I INPUT -m set --match-set testset src -j DROP
# ipset del testset 1.2.3.4

Attempting to add an IP address that is already in the list causes a warning, and the address isn’t added. So no need to check if the address is already there. Besides, there the -exist option, which is really great.

List the members of the IP set:

# ipset -L

Timeout

An entry can have a timeout feature, which works exactly as one would expect: The rule vanishes after the timeout expires. The timeout entry in ipset -L counts down.

For this to work, the set must be created with a default timeout attribute. Zero means that timeout is disabled (which I chose as a default in this example).

# ipset create testset hash:ip timeout 0
# ipset add testset 1.2.3.4 timeout 10

The ‘-exist’ flag causes ipset to re-add an existing entry, which also resets its timeout. So this is the way to keep the list fresh.

Don’t put the DROP rule first

It’s tempting to put the DROP rule with –match-set first, because hey, let’s give those intruders the boot right away. But doing that, there might be TCP connections lingering, because the last FIN packet is caught by the firewall as the new rule is added. Given that adding an IP address is the result of a flood of requests, this is a realistic scenario.

The solution is simple: There’s most likely a “state RELATED,ESTABLISHED” rule somewhere in the list. So push it to the top. The rationale is simple: If a connection has begun, don’t chop it in the middle in any case. It’s the first packet that we want killed.

Persistence

The rule in iptables must refer to an existing set. So if the rule that relies on the set is part of the persistent firewall rules, it must be created before the script that brings up iptables runs.

This is easily done by adding a rule file like this as /usr/share/netfilter-persistent/plugins.d/10-ipset

#!/bin/sh

IPSET=/sbin/ipset
SET=mysiteset

case "$1" in
start|restart|reload|force-reload)
	$IPSET destroy
	$IPSET create $SET hash:ip timeout 0
	;;

save)
	echo "ipset-persistent: The save option does nothing"
	;;

stop|flush)
	$IPSET flush $SET
	;;
*)
    echo "Usage: $0 {start|restart|reload|force-reload|save|flush}" >&2
    exit 1
    ;;
esac

exit 0

The idea is that the index 10 in the file’s name is smaller than the rule that sets up iptables, so it runs first.

This script is a dirty hack, but hey, it works. There’s a small project on this, for those who like to do it properly.

The operating system in question is systemd-based, but this old school style is still in effect.

Maybe block by country?

Since all offending requests came from the same country (cough, cough, China, from more than 4000 different IP addresses) I’m considering to block them in one go. A list of 4000+ IP addresses that I busted in August 2022 with aggressive bots (all from China) can be downloaded as a simple compressed text file.

So the idea is going something like

ipset create foo hash:net
ipset add foo 192.168.0.0/24
ipset add foo 10.1.0.0/16
ipset add foo 192.168.0/24

and download the per-country IP ranges from IP deny. That’s a simple and crude tool for denial by geolocation. The only thing that puts me down a bit is that it’s > 7000 rules, so I wonder if that doesn’t put a load on the server. But what really counts is the number of sizes of submasks, because each submask size has its own hash. So if the list covers all possible sizes, from a full /32 down to say, 16/, there are 17 hashes to look up for each packet arriving.

On the other hand, since the rule should be after the “state RELATED,ESTABLISHED” rule, it only covers SYN packets. And if this whole thing is put as late as possible in the list of rules, it boils down to handling only packets that are intended for the web server’s ports, or those that are going to be dropped anyhow. So compared with the CPU cycles of handling the http request, even 17 hashes isn’t all that much.

The biggest caveat is however if other websites are colocated on the server. It’s one thing to block offending IPs, but blocking a whole country from all sites, that’s a bit too much.

Note to self: In the end, I wrote a little Perl-XS module that says if the IP belongs to a group. Look for byip.pm.

The blacklisting script

The Perl script that performs the blacklisting is crude and inaccurate, but simple. This is the part to tweak and play with, and in particular adapt to each specific website. It’s all about detecting abnormal access.

Truth to be told, I replaced this script with a more sophisticated mechanism pretty much right away on my own system. But what’s really interesting is the calls to ipset.

This script reads through Apache’s access log file, and analyzes each minute in time (as in 60 seconds). In other words, all accesses that have the same timestamp, with the seconds part ignored. Note that the regex part that captures $time in the script ignores the last part of :\d\d.

If the same IP address appears more than 50 times, that address is blacklisted, with a timeout of 86400 seconds (24 hours). Log file that correspond to page requisites and such (images, style files etc.) are skipped for this purpose. Otherwise, it’s easy to reach 50 accesses within a minute with legit web browsing.

There are several imperfections about this script, among others:

Since it reads through the entire log file each time, it keeps relisting each IP address until the access file is rotated away, and a new one is started. This causes an update of the timeout, so effectively the blacklisting takes place for up to 48 hours.
Looking in segments of accesses that happen to have the same minute in the timestamp is quite inaccurate regarding which IPs are caught and which aren’t.

The script goes as follows:

#!/usr/bin/perl
use warnings;
use strict;

my $logfile = '/var/log/mysite.com/access.log';
my $limit = 50; # 50 accesses per minute
my $timeout = 86400;

open(my $in, "<", $logfile)
  or die "Can't open $logfile for read: $!\n";

my $current = '';
my $l;
my %h;
my %blacklist;

while (defined ($l = <$in>)) {
  my ($ip, $time, $req) = ($l =~ /^([^ ]+).*?\[(.+?):\d\d[ ].*?\"\w+[ ]+([^\"]+)/);
  unless (defined $ip) {
    #    warn("Failed to parse line $l\n");
    next;
  }

  next
    if ($req =~ /^\/(?:media\/|robots\.txt)/);

  unless ($time eq $current) {
    foreach my $k (sort keys %h) {
      $blacklist{$k} = 1
	if ($h{$k} >= $limit);
    }

    %h = ();
    $current = $time;
  }
  $h{$ip}++;
}

close $in;

foreach my $k (sort keys %blacklist) {
  system('/sbin/ipset', 'add', '-exist', 'mysiteset', $k, 'timeout', $timeout);
}

It has to be run as root, of course. Most likely as a cronjob.

Google Translate, LaTeX and asian languages: Technical notes

eli — Mon, 15 Aug 2022 07:18:50 +0000

Introduction

These post contains a few technical notes of using Google Translate for translating LaTeX documents into Chinese, Japanese and Korean. The insights on the language-related issues are written down in a separate post.

Text vs. HTML

Google’s cloud translator can be fed with either plain text or HTML, and it returns the same format. Plain text format is out of the question for anything but translating short sentences, as it becomes impossible to maintain the text’s formatting. So I went for the HTML interface.

The thing with HTML is that whitespaces can take different forms and shapes, and they are redundant in many situations. For example, a newline is often equivalent to a plain space, and neither make any difference between two paragraphs that are enclosed by

tags.

Google Translate takes this notion to the extreme, and typically removes all newlines from the original text. OK, that’s understandable. But it also adds and removes whitespaces where it had no business doing anything, in particular around meaningless segments that aren’t translated anyhow. This makes it quite challenging when feeding the results for further automatic processing.

Setting up a Google Cloud account

When creating a new Google Cloud account, there’s an automatic credit of $300 to spend for three months. So there’s plenty of room for much needed experimenting. Too see the status of the evaluation period, go to Billing > Cost Breakdown and wait a minute or so for the “Free trial status” strip to appear at the top of the page. There’s no problem with “activating full account” immediately. The free trial credits remain, but it also means that real billing occurs when the credits are consumed and/or the trial period is over.

First create a new Google cloud account and enable the Google Translate API.

I went for Basic v2 translation (and not Advanced, v3). Their pricing is the same, but v3 is not allowed with an API key, and I really wasn’t into setting up a service account and struggle with OAuth2. The main advantage with v3 is the possibility to train the machine to adapt to a specific language pattern, but as mentioned in that separate post, I’m hiding away anything but common English language patterns.

As for authentication, I went for API keys. I don’t need any personalized info, so that’s the simple way to go. To obtain the keys, go to main menu (hamburger icon) > APIs and services > Credentials and pick Create Credentials, and choose to create API keys. Copy the string and use it in the key=API_KEY parameters in POST requests. It’s possible to restrict the usage of this key in various ways (HTTP referrer, IP address etc.) but it wasn’t relevant in my case, because the script runs only on my computer.

The web interface for setting up cloud services is horribly slow, which is slightly ironic and a bit odd for a company like Google.

The translation script

I wrote a simple script for taking a piece of text in English and translating it into the language of choice:

#!/usr/bin/perl

use warnings;
use strict;
use LWP::UserAgent;
use JSON qw[ from_json ];

our $WASTEMONEY = 0; # Prompt before making request
my $MAXLEN = 500000;
my $chars_per_dollar = 50000; # $20 per million characters

our $APIkey = 'your API key here';

my ($outfile, $origfile, $lang) = @ARGV;

die("Usage: $0 outfile origfile langcode\n")
  unless (defined $origfile);

my $input = readfile($origfile);

askuser() unless ($WASTEMONEY);

my $len = length $input;

die("Cowardly refusing to translate $len characters\n")
  if ($len > $MAXLEN);

writefile($outfile, translate($input, $lang));

################## SUBROUTINES ##################

sub writefile {
  my ($fname, $data) = @_;

  open(my $out, ">", $fname)
    or die "Can't open \"$fname\" for write: $!\n";
  binmode($out, ":utf8");
  print $out $data;
  close $out;
}

sub readfile {
  my ($fname) = @_;

  local $/; # Slurp mode

  open(my $in, "<", $fname)
    or die "Can't open $fname for read: $!\n";

  my $input = <$in>;
  close $in;

  return $input;
}

sub askuser {
  my $len = length $input;
  my $cost = sprintf('$%.02f', $len / $chars_per_dollar);

  print "\n\n*** Approval to access Google Translate ***\n";
  print "$len bytes to $lang, $cost\n";
  print "Source file: $origfile\n";
  print "Proceed? [y/N] ";

  my $ans = ;

  die("Aborted due to lack of consent to proceed\n")
    unless ($ans =~ /^y/i);
}

sub translate {
  my ($text, $lang) = @_;

  my $ua = LWP::UserAgent->new;
  my $url = 'https://translation.googleapis.com/language/translate/v2';

  my $res = $ua->post($url,
		      [
		       source => 'en',
		       target => $lang,
		       format => 'html', # Could be 'text'
		       key => $APIkey,
		       q => $text,
		      ]);

  die("Failed to access server: ". $res->status_line . "\n")
    unless ($res->is_success);

  my $data = $res->content;

  my $json = from_json($data, { utf8 => 1 } );

  my $translated;

  eval {
    my $d = $json->{data};
    die("Missing \"data\" entry\n") unless (defined $d);

    my $tr = $d->{translations};
    die("Missing \"translations\" entry\n")
      unless ((defined $tr) && (ref $tr eq 'ARRAY') &&
	     (ref $tr->[0] eq 'HASH'));

    $translated = $tr->[0]->{translatedText};

    die("No translated text\n")
      unless (defined $translated);
  };

  die("Malformed response from server: $@\n") if ($@);

  $translated =~ s/(<\/(?:p|h\d+)>)[ \t\n\r]*/"$1\n"/ge;

  return $translated;
}

The substitution at the end of the translate() function adds a newline after each closing tag for a paragraph or header (e.g.

etc.) so that the HTML is more readable with a text editor. Otherwise it’s all in one single line.

Protecting your money

By obtaining an API key, you effectively give your computer permission to spend money. Which is fine as long as it works as intended, but a plain bug in a script that leads to an infinite loop or recursion, or maybe just feeding the system with a huge file by mistake, can end up with consequences that are well beyond the CPU fan spinning a bit.

So there are two protection mechanisms in the script itself:

The script prompts for permission, stating how much it will cost (based upon $20 / million chars).
It limits a single translation to 500k chars (to avoid a huge file from being processed accidentally).

Another safety mechanism is to set up budgets and budget alerts. Go to Main menu (hamburger) > Billing > Budgets & Alerts. Be sure to check “Email alerts to billing admins and users”. If I got it right, budgets don’t protect against spending, but only sends notifications. So I selected a sum, and enabled only the 100% threshold. It seems to make sense to check all the Discounts and Promotion options in the Credits part, which makes sure that the alert is given for the money to be spent by deducing all promotion credits.

On top of that, it’s a good idea to set quota limits: Go to Main menu (hamburger) > IAM & Admin > Quotas. Set the filter to Translation to get rid of a lot of lines.

It’s also the place to get an accurate figure for the current consumption.

Enable the quota for “v2 and v3 general model characters per day”, which is the only character limit that isn’t per minute, and set it to something sensible, for example 2 million characters if you’re a modest user like myself. That’s $40, which is fairly acceptable damage if the computer goes crazy, and high enough not to hit the roof normally.

Also do something with “v3 batch translation characters using general models per day” and same with AutoML custom models. I don’t use these, so I set both to zero. Just to be safe.

There’s “Edit Quotas” to the top right. Which didn’t work, probably because I did this during the trial period, so quotas are meaningless, and apparently disabled anyhow (or more precisely, enabled to fixed limits).

So the way to do it was somewhat tricky (as it’s probably pointless): To enable a quota, right-click the “Cloud Translation API” to the left of the quota item, and open it in a new tab. Set up the quota figure there. But this description on how to do it might not be accurate for a real-life use. Actually, the system ignored my attempts to impose limits. They appeared on the page for editing them, but not on the main page.

Supporting CJK in LaTeX

I’m wrapping up this post with notes on how to feed LaTeX (pdflatex, more precisely) with Chinese, Japanese and Korean, with UTF-8 encoding, and get a hopefully reasonable result.

So first grab a few packages:

# apt install texlive-lang-european
# apt install texlive-lang-chinese
# apt install texlive-lang-korean
# apt install texlive-cjk-all

Actually, texlive-lang-european isn’t related, but as its name implies, it’s useful for European languages.

I first attempted with

\usepackage[UTF8]{ctex}

but pdflatex failed miserably with an error saying that the fontset ‘fandol’ is unavailable in current mode, whatever that means. After trying a few options back and forth, I eventually went for the rather hacky solution of using CJKutf8. The problem is that CJK chars are allowed only within

\begin{CJK}{UTF8}{gbsn}

[ ... ]

\end{CJK}

but I want it on the whole document, and I need the language setting to be made in a file that is included by the main LaTeX file (a different included file for each language). So I went for this simple hack:

\AtBeginDocument{\begin{CJK}{UTF8}{gbsn}}
\AtEndDocument{\end{CJK}}

As for the font, it appears like gbsn or gkai fonts should be used with Simplified Chinese, and bsmi or bkai for with Traditional Chinese. Since I translated into Simplified Chinese, some characters just vanished from the output document when trying bsmi and bkai. The back-translation to English of a document made with bsmi was significantly worse, so these dropped characters had a clear impact in intelligibility of the Chinese text.

I got this LaTeX warning saying

LaTeX Font Warning: Some font shapes were not available, defaults substituted.

no matter which of these fonts I chose, so it doesn’t mean much.

So the choice is between gbsn or gkai, but which one? To decide, I copy-pasted Chinese text from updated Chinese websites, and compared the outcome of LaTeX, based upon the TeX file shown below. It was quite clear that gbsn is closer to the fonts in use in these sites, even though I suspect it’s a bit of a Times New Roman: The fonts used on the web have less serifs than gbsn. So gbsn it is, even though it would have been nicer with a font with less serifs.

For Japanese, there’s “min”, “maru” and “goth” fonts. “Min” is a serif font, giving it a traditional look (calligraphy style) and judging from Japanese websites, it appears to be used primarily for logos and formal text (the welcoming words of a university’s president, for example).

“Maru” and “goth” are based upon simple lines, similar to plain text in Japanese websites. The latter is a bit of a bold version of “maru”, but it’s what seems to be popular. So I went with “goth”, which has a clean and simple appearance, similar to the vast majority of Japanese websites, even though the bold of “goth” can get a bit messy with densely drawn characters. It’s just that “maru” looks a bit thin compared to what is commonly preferred.

Korean has two fonts in theory, “mj” and “gt”. “mj” is a serif font with an old fashioned look, and “gt” is once again the plain, gothic version. I first failed to use the “gt” font even though it was clearly installed (there were a lot of files in the same directories as where the “mj” files were installed, only with “gt”). Nevertheless, trying the “gt” font instead of “mj” failed with

LaTeX Font Warning: Font shape `C70/gt/m/it' undefined
(Font)              using `C70/song/m/n' instead on input line 8.

! Undefined control sequence.
try@size@range ...extract@rangefontinfo font@info
                                                  <-*>@nil <@nnil

But as it turns out, it should be referred to as “nanumgt”, e.g.

\begin{CJK}{UTF8}{nanumgt}
나는 멋진 글꼴을 원한다
\end{CJK}

It’s worth mentioning XeLaTeX, which allows using an arbitrary True Type font withing LaTeX, so the font selection is less limited.

See this page on fonts in Japanese and Korean.

For these tests, I used the following LaTeX file for use with e.g.

$ pdflatex test.tex

\documentclass{hitec}
\usepackage[utf8]{inputenc}
\usepackage[T1]{fontenc}
\usepackage{CJKutf8}
\newcommand{\thetext}
{

它说什么并不重要，重要的是它是如何写的。
}

\AtBeginDocument{}
\AtEndDocument{}
\title{This document}
\begin{document}

gbsn:

\begin{CJK}{UTF8}{gbsn}
\thetext
\end{CJK}

gkai:

\begin{CJK}{UTF8}{gkai}
\thetext
\end{CJK}

bsmi:

\begin{CJK}{UTF8}{bsmi}
\thetext
\end{CJK}

bkai:

\begin{CJK}{UTF8}{bkai}
\thetext
\end{CJK}

\end{document}

Random notes on Perl Regular Expressions

eli — Sun, 10 Jul 2022 04:05:23 +0000

It’s 2022, Perl isn’t as popular as it used to be, and for a moment I questioned its relevance. Until I had a task requiring a lot of pattern matching, which reminded me why Perl is that loyal companion that always has an on-spot solution to whatever I need.

These are a few notes I took as I discovered the more advanced, and well-needed, features of Perl regexps.

If a regex is passed as a value generated by qr//, the modifiers in this qr// have a significance. So e.g. if the match should be case-insensitive, add it after the qr//.
Quantifiers can be used on regex groups, whether they capture or not. For example, \d+(?:\.\d+)+ means one or more digits followed by one or more patterns of a dot and one or more digits. Think BNF.
Complex regular expressions can be created relatively easily by breaking them down into smaller pieces and assigning each a variable with qr//. The complex expression becomes fairly readable this way. Almost needless to say, quantifiers can be applied on each of these subexpressions.

It’s possible to give capture elements names, e.g. $t =~ /^(?

.*?)(?[ \t\n]*${regex}[ \t\n]*)(?.*)$/s. The capture results then appear in e.g. $+{pre}, $+{found} and $+{post}. This is useful in particular if the regex in the middle may have capture elements of its own, so the usual counting method doesn’t work.

Captured elements can be used in the regex itself, e.g. /([\'\"])(.*?)\1/ so \1 stands for either a single or double quote, whichever was found.
Even better, there’s e.g \g{-1} instead of numeric grouping, which in this case means that last group captured. Once again, useful in a regex that can be used in more complicated contexts.
When there are nested unnamed capture parentheses, the outer parenthesis gets the first capture number.
If there are several capture parentheses with a ‘|’ between them, all of them produce a capture position, but those that weren’t in use for matching get undef.
(?:…) grouping can be followed by a quantifier, so this makes perfect sense ((?:[^\\\{\}]|\\\\|\\\{|\\\})*) for any number of characters that aren’t a backslash or a curly bracket, or any of these followed by an escape.
Quantifiers can be super-greedy in the sense that they don’t allow backtracking. So e.g. /a++b/ is exactly like /a+b/, but with the former the computer won’t attempt to consume less a’s (if such are found) in order to try to find a “b”. This is just an optimization for speed. All of these extra-greedy quantifiers are made with an extra plus sign.
There’s lookbehind and lookahead assertions, which are really great. In particular, the negative assertions. E.g. /(?
Lookaheads and lookbehinds also work inside grouping parentheses (whether capturing or not), as grouping is treated as an independent regex.

Thunderbird: Upgrade notes

eli — Fri, 10 Jun 2022 15:16:22 +0000

Introduction

These are my notes as I upgraded Thunderbird from version 3.0.7 (released September 2010) to 91.10.0 on Linux Mint 19. That’s more than a ten year’s gap, which says something about what I think about upgrading software (which was somewhat justified, given the rubbish issues that arose, as detailed below). What eventually forced me to do this was the need to support OAuth2 in order to send emails through Google’s Gmail server (supported since 91.8.0).

Thunderbird is essentially a Firefox browser which happens to be set up with a GUI that processes emails. So for example, the classic menubar is hidden, but can be revealed by pressing Alt.

Using the correct profile

When attempting to run a new version of Thunderbird, be sure to rename ~/.thunderbird into something else, or else the current profile will be upgraded right away. With some luck, the suffixes (e.g. -release) might make Thunderbird ignore the old information, but don’t trust that.

Actually, it seems like this is handled gracefully anyhow. When I installed exactly the same version on a different position on the disk, it ignored the profile with -release suffix, and added one with -release-1. So go figure.

To select which profile to work with, invoke Thunderbird with Profile Manager with

$ thunderbird -profilemanager &

For making the upgrade, first make a backup tarball from the original profile directory.

To adopt in into the new version of Thunderbird, invoke the Profile Manager and pick Create Profile…, create a new directory (I called it “mainprofile”), and pick that as the place for the new profile. Launch Thunderbird, quit right away, and then delete the new directory. Rename the old directory with the new deleted directory’s name. Then launch Thunderbird again.

Add-ons

Previously, I had the following add-ons:

BiDi Mail UI (apparently still necessary)
Clippings. Just import the previous clippings from ~/clipdat2.rdf. Unlike the old version, the data is kept in a database file inside the profile, so the old file can be deleted.
Gnome Integration Options for calling a command on mail arrival. It was deprecated. So I went for Mailbox Alert, which allows adding specific actions: Sound, a message and/or command. With mail folder granularity, in fact.
Mail Tweak. It’s really really old, and probably unnecessary since long.
Outgoing Message Format (for text vs. HTML messages). Deprecated since long, as these options are integrated into Thunderbird itself.

So I remained with the first two only.

Installing Thunderbird

The simplest Thunderbird installation involves downloading it from their website and extract the tarball somewhere in the user’s own directories. For a proper installation, I installed it under /usr/local/bin/ with

# tar -C /usr/local/bin -xjvf thunderbird-91.10.0.tar.bz2

as root. And then reorganize it slightly:

# cd /usr/local/bin
# mv thunderbird thunderbird-91.10.0
# ln -s thunderbird-91.10.0/thunderbird

Composing HTML messages

Right-click the account at the left bar, pick Settings and select the Composition & Addressing item. Make sure Compose messages in HTML is unchecked: Messages should be composed as plain text by default.

Then go through each of the mail identities and verify that Compose messages in HTML is unchecked under the Composition & Addressing tab.

However if Shift is pressed along with clicking Write, Reply or whatever for composing a new message, Thunderbird opens it as HTML.

Recover old contacts

Thunderbird went from the old *.mab format to SQLite for keeping the address books. So go Tools > Import… > Pick Address Books… and pick Monk Database, and from there pick abook.mab (and posssibly repeat this with history.mab, but I skipped this, because it’s too much).

Silencing update notices

Thunderbird, like most software nowadays, wants to update itself automatically, because who cares if something goes wrong all of the sudden as long as the latest version is installed.

I messed around with this for quite long until I found the solution. So I’m leaving everything I did written here, but it’s probably enough with just adding policies.json, as suggested below.

So to the whole story (which you probably want to skip): Under Preferences > General > Updates I selected “check for updates” rather than install automatically (it can’t anyhow, since I’ve installed Thunderbird as root), but then it starts nagging that there are updates.

So it’s down to setting the application properties manually by going to Preferences > General > Config Editor… (button at the bottom).

I changed app.update.promptWaitTime to 31536000 (365 days) but that didn’t have any effect. So I added an app.update.silent property and set it true, but that didn’t solve the problem either. So the next step was to change app.update.staging.enabled to false, and that did the trick. Well, almost. With this, Thunderbird didn’t issue a notification, but its tab on the system tray gets focus every day. Passive aggressive.

As a side note, there are other suggestions I’ve encountered out there: To change app.update.url so that Thunderbird doesn’t know where to look for updates, or set app.update.doorhanger false. Haven’t tried either.

So what actually worked: Create a policies.json in /usr/local/bin/thunderbird/distribution/, with “DisableAppUpdate“: true, that is:

{
 "policies": {
  "DisableAppUpdate": true
 }
}

Note that the “distribution” directory must be in the same the directory as the actual executable for Thunderbird (that is, follow the symbolic link if such exists). In my case, I had to add this directory myself, because of a manual installation.

And, as suggested on this page, the successful deployment can be verified by restarting Thunderbird, and then looking at Help > About inside Thunderbird, which now says (note the comment on updates being disabled):

In hindsight, I can speculate on why this works: The authors of Thunderbird really don’t want us to turn off automatic updates, mainly because if people start running outdated software, that increases the chance of a widespread attack on some vulnerability, which can damage the software’s reputation. So Thunderbird is designed to ignore previous possibilities to turn the update off.

There’s only one case where there’s no choice: If Thunderbird was installed by the distribution. In this case, it’s installed as root, so it can’t be updated by a plain user. Hence it’s the distribution’s role to nag. And it has the same interest to nag about upgrades (reputation and that).

So I guess that’s why Thunderbird respects this JSON file only.

Folders with new mails in red

Exactly like 10 years ago, the trick is to create a “chrome” directory under .thunderbird/ and then add the following file:

$ cat ~/.thunderbird/sdf2k45i.default/chrome/userChrome.css
@namespace
url("http://www.mozilla.org/keymaster/gatekeeper/there.is.only.xul"); /* set default namespace to XUL */

/* Setting the color of folders containing new messages to red */

treechildren::-moz-tree-cell-text(folderNameCol, newMessages-true) {
 font-weight: bold;
 color: red !important;
}

But unlike old Thunderbird, this file isn’t read by default. So to fix that, go to Preferences > General > Config Editor… (button at the bottom) and there change toolkit.legacyUserProfileCustomizations.stylesheets to true.

New mail icon in system tray

Thunderbird sends a regular notification when a new mail arrives, but exactly like last time, I want a dedicated icon that is dismissed only when I click it. The rationale is to be able to see if a new mail has arrived at a quick glance of the system tray. Neither zenity –notification nor send-notify were good for this, since they send the common notification (zenity used to just add an icon, but it “got better”).

But then there’s yad. I began with “apt install yad”, but that gave me a really old version that distorted the icon in the system bar. So I installed it from the git repository’s tag 1.0. I first attempted v12.0, but I ended up with the problem mentioned here, and didn’t want to mess around with it more.

Its “make install” adds /usr/local/bin/yad, as well as a lot of yad.mo under /usr/local/share/locale/*, a lot of yad.png under /usr/local/share/icons/*, yad.m4 under /usr/local/share/aclocal/ and yad.1 + pfd.1 in /usr/local/share/man/man1. So quite a lot of files, but in a sensible way.

With this done, the following script is kept (as executable) as /usr/local/bin/new-mail-icon:

#!/usr/bin/perl
use warnings;
use strict;
use Fcntl qw[ :flock ];

my $THEDIR="$ENV{HOME}/.thunderbird";
my $ICON="$THEDIR/green-mail-unread.png";

my $NOW=scalar localtime;

open(my $fh, "<", "$ICON")
  or die "Can't open $ICON for read: $!";

# Lock the file. If it's already locked, the icon is already
# in the tray, so fail silently (and don't block).

flock($fh, LOCK_EX | LOCK_NB) or exit 0;

fork() && exit 0; # Only child continues

system('yad', '--notification', "--text=New mail on $NOW", "--image=$ICON", '--icon-size=32');

This script is the improved version of the previous one, and it prevents multiple icons in the tray much better: It locks the icon file exclusively and without blocking. Hence if there’s any other process that shows the icon, subsequent attempts to lock this file fail immediately.

Since the “yad” call takes a second or two, the scripts forks and exits before that, so it doesn’t delay Thunderbird’s machinery.

With this script in place, the Mailbox Alert is configured as follows. Add a new item to the list as in this dialog box:

The sound should be set to a WAV file of choice.

Then right-click the mail folder to have covered (Local Folders in my case), pick Mailbox Alert and enable “New Mail” and “Alert for child folders”.

Then right-click “Inbox” under this folder, and verify that nothing is checked for Mailbox Alert for it (in particular not “Default sound”). That except for the Outbox and Draft folders, for which “Don’t let parent folders alert for this one” should be checked, or else there’s a false alarm on autosaving and when using “send later”.

Later on, I changed my mind and added a message popup, so now all three checkboxes are ticked, and the Message tab reads:

I picked the icon as /usr/local/bin/thunderbird-91.10.0/chrome/icons/default/default32.png (this depends on the installation path, of course).

I’m not 100% clear why the original alert didn’t show up, even though “Show an alert” was still checked under “Incoming Mails” at Preferences > General. I actually preferred the good old one, but it seems like Mailbox Alert muted it. I unchecked it anyhow, just to be safe.

Refusing to remember passwords + failing to sent through gmail

It’s not a real upgrade if a weird problem doesn’t occur out of the blue.

So attempting to Get Messages from pop3 server at localhost failed quite oddly: Every time I checked the box to use Password Manager to remember the password, it got stuck with “Main: Connected to 127.0.0.1…”. But checking with Wireshark, it turned out that Thunderbird asked the server about its capabilities (CAPA), got an answer and then did nothing for about 10 seconds, after which it closed the connection.

On the other hand, when I didn’t request remembering the password, it went fine, and so did subsequent attempts to fetch mail from the pop3 server.

Another thing was that when attempting to use Gmail’s server, I went through the entire OAuth2 thing (the browser window, and asking for my permissions) but then the mail was just stuck on “Sending message”. Like, forever.

So I followed the advice here, and deleted key3.db, key4.db, secmod.db, cert*.db and all signon* files with Thunderbird not running of course. Really old stuff.

And that fixed it.

The files that were apparently created when things got fine were logins.json, cert9.db, key4.db and pkcs11.txt. But I might have missed something.

The GUI stuck for a few seconds every now and then

This happened occasionally when I navigated from one mail folder to another. The solution I found somewhere was to delete all .msf files from where Thunderbird keeps the mail info, and that did the trick. Ehm, just for a while. After a few days, it was back.

As a side effect, it forgot the display settings for each folder, i.e. which columns to show and in what order.

These .msf files are apparently indexes to the files containing the actual messages, and indeed it took a few seconds before something appeared when I went to view each mail folder for the first time. At which time the new .msf files went from zero bytes to a significant figure.

Since the problem remains, I watched “top” when the GUI got stuck. And indeed, Thunderbird’s process was at 100%, but so was a completely different process: caribou. Which is a virtual keyboard. Do I need one? No. So to get rid of this process (which runs all the time, but doesn’t eat a lot of CPU normally), go Accessibility settings, the Keyboard tab and turn “Enable the on-screen keyboard” off. The process is gone, and so is the problem with the GUI? Nope. It’s basically the same, but instead of two processes taking 100% CPU, now it’s Thunderbird alone. I have no idea what to do next.

Perl: Matching apparently plain space in HTML with regular expression

eli — Wed, 05 Jan 2022 14:21:31 +0000

I’ve been using a plain space character in Perl regular expressions since ages, and it has always worked. Something like this for finding double spaces:

my @doubles = ($t =~ / {2,}/g);

or for emphasis on the space character, equivalently:

my @doubles = ($t =~ /[ ]{2,}/g);

but then I began processing HTML representation from the Mojo::DOM module (or TinyMCE’s output directly) and this just didn’t work. That is, \s detected the spaces (with Perl 5.26) but the plain space character didn’t.

As it turns out, TinyMCE put instead of the first space (when there was a pair of them), which Mojo::DOM correctly translated to the 0xa0 Unicode character (0xc2, 0xa0 in UTF-8). Hence no chance that a plain space, i.e. a 0x20, will match it. Perl was clever enough to match it as a whitespace (with \s).

Solution: Simple. Just go

my @doubles = ($t =~ /[ \xa0]{2,}/g);

In other words, match either the good old space or the non-breakable space.