my tech blog » Internet

Perl one-liner for adding newlines to HTML

eli — Thu, 12 Mar 2026 11:35:32 +0000

When the rich editor puts all HTML in one line, and I want to edit it, I could always use the “tidy” utility, however it does too much. All I want is a newline here and there to make the whole thing accessible.

So this simple one-liner does the job:

perl -pe 's/(<\/(?:p|h\d|div|tr|td|table|ul|ol|li)>)/"$1\n"/ge'

Not perfect, but gives something to work with.

Altering the Message-ID header in Thunderbird for non-spam detection

eli — Sat, 10 Aug 2024 10:38:18 +0000

TL;DR

In this post, I suggest manipulating the Message IDs of outgoing mails, so that legit inbound replies to my mails are easily detected as non-spam. I also show how to do this with Thunderbird (Linux version 91.10.0, but it works with practically all versions, I believe).

Briefly about Message-ID

Each email should have a Message-ID header, which uniquely identifies this message. The value of this header should consist of a random string, followed by an ‘@’ and a string that represents the domain name (referred to as FQDN, Fully Qualified Domain Name). This is often the full domain name of the “From” header (e.g. gmail.com).

For example, an email generated by Gmail’s web client had Message-ID: . A similar result (same FQDN) was obtained when sending from the phone. However, when using Thunderbird to send an email, only “gmail.com” was set as the FQDN.

Does the Message-ID matter?

Like anything related to email, there are a lot of actors, and each has its own quirks. For example, rspamd adds the spam score by 0.5, with the MID_RHS_NOT_FQDN rule, if the Message ID isn’t an FQDN. I’m not sure to which extent it checks that the FQDN matches the email’s From, but even if it does, it can’t be that picky, given the example I showed above in relation to gmail.com.

It’s quite rare that people care about this header. I’ve seen somewhere that someone sending mails from a work computer didn’t like that the name of the internal domain leaking.

All in all, it’s probably a good idea to make sure that the Message-ID header looks legit. Putting the domain from the From header seems to be a good idea to keep spam filters happy.

Why manipulate the Message-ID?

In an reply, the In-Reply-To header gets the value of the Message ID of the message replied to. So if a spam filter can identify that that the email is genuinely a reply to something I sent, it’s definitely not spam. It’s also a good idea to scan the References header too, in order to cover more elaborate scenarios when there are several people corresponding.

The rigorous way to implement this spam filtering feature is storing the Message IDs of all sent mails in some small database, and check for a match with the content of In-Reply-To of arriving mails. Possible, however daunting.

A much easier way is to change the FQDN part, so that it’s easily identifiable. This is unnecessary if you happen send emails with your own domain, as spam senders are very unlikely to add an In-Reply-To with a matching domain (actually, very few spam messages have an In-Reply-To header at all).

But for email sent through gmail, changing the FQDN to something unique is required to make a distinction.

Will this mess up things? I’m not sure any software tries to fully match the FQDN with the sender, but I suppose it’s safe to add a subdomain to the correct domain. I mean, if both “mail.gmail.com” and “gmail.com” are commonly out there, why shouldn’t “secretsauce.gmail.com” seem likewise legit to any spam filter that checks the message?

And by the way, as of August 2024, a DNS query for mail.gmail.com yields no address, neither for A nor MX. In other words, Gmail itself uses an invalid domain in its Message ID, so any other invented subdomain should do as well.

Changing the FQDN on Thunderbird

Click the hamburger icon, choose Preferences, and scroll down all the way (on the General tab) and click on Config Editor.

First, we need to find Thunderbird’s internal ID number for the mail account to manipulate.

To get a list of IDs, write “useremail” in the search text box. This lists entries like mail.identity.id1.useremail and their values. This listing allows making the connection between e.g. “id1″ and the email address related to it.

For example, to change the FQDN of the mail account corresponding to “id3″, add a string property (using the Config Editor). The key of this property is “mail.identity.id3.FQDN” and the value is something like “secretsauce.gmail.com”.

There is no need to restart Thunderbird. The change is in effect on the next mail sent, and it remains in the settings across restarts.

The need for this feature has been questioned, as was discussed here. So if any Thunderbird maintainer reads this, please keep this feature up and running.

A possible alternative approach

Instead of playing around with the Message-ID, it would be possible to add an entry to the References header (or add this header if there is none). The advantage of this way is that this can also be done by the MTA further down the delivery path, and it doesn’t alter anything that is already in place.

And since it’s an added entry, it can also be crafted arbitrarily. For example, it may contain a timestamp (epoch time in hex) and the SHA1 sum of a string that is composed by this timestamp and a secret string. This way, this proof of genuine correspondence is impossible to forge and may expire with time.

I haven’t looked into how to implement this in Thunderbird. Right now I’m good with the Message-ID solution.

ssl.com stealing from my credit card, again

eli — Mon, 13 May 2024 04:16:04 +0000

Credit card abuse, episode #2

ssl.com presents the lowest price for an EV code signing certificate, however it’s a bit like going into a flea market with a lot of pickpockets around: Pay attention to your wallet, or things happen.

This is a follow-up post to one that I wrote three years ago, after ssl.com suddenly charged my credit card in relation to the eSigner service. It turns out that this a working pattern that persists even three years later. Actually, it was $200 last time, and $747 now, so one could say they’re improving.

The autorenew fraud

Three years ago, I got a EV code signing certificate from ssl.com that expired more or less at the time of writing this. I got a reminder email from ssl.com, urging me to renew this certificate, and indeed, I ordered a one-year certificate so I could continue to sign drivers. I paid for the one-year certificate and went through a brief process of authenticating my identity, and got an approval soon enough.

I’ll say a few words below about the technicalities around getting the certificate, but all in all the process was finished after a few days, and I thought that was the end of it.

And then, I randomly checked my credit card bill and noticed that it had been charged with 747 USD by ssl.com. So I contacted them to figure out what happened. The answer I got was:

{order number} is an auto renewal for the expiring order. But, I do see that you already manually renewed and renewal cert issued.

I can cancel {order number} then credit the amount to your SSL.com account. Would that be good with you?

Indeed, the automatic renewal order was issued after I had completed the process with the new certificate, so surely there was no excuse for an automatic renewal. And the offer to add the funds to my account in ssl.com for future use was of course a joke (even though they were serious about it, of course).

It’s worth mentioning that the reminder email said nothing about what would happen if I didn’t renew the certificate. And surely, there was no hint about any automatic mechanism for a renewal.

On top of that, I got no notification whatsoever about the automatic renewal or that my credit card had been charged. Needless to say, I didn’t approve this renewal. In fact, I made the order for the one-year certificate on a different and temporary credit card, because I learned the lesson from three years ago. Or so I thought.

So I asked them to cancel the order and refund my credit card. Basically, the answer I got was

I have forwarded to the billing team about the refund request. They will email you once they have an update.

Sounds like a fairly happy end, doesn’t it? Only they didn’t cancel the order, let alone refund the credit card. During two weeks I sent three reminders, and the answer was repeatedly that my requests and reminders had been forwarded to “the team”, and that’s where it ended. Who knows, maybe I’ll just forget about it.

I sent the fourth reminder to billing@ssl.com (and not support@ssl.com), so I got some kind of response. I was once again offered to fill up my wallet on ssl.com with the money instead of a refund. To which I responded negatively, of course. In fact, I turned to slightly harsher language, saying that ssl.com’s conduct makes them no better than a street pickpocket.

And interestingly enough, the response was that my refund request “had been approved”. A day later, I got a confirmation that a refund had taken place. The relevant order remained in the ssl.com’s Dashboard as “pending validation”, but at the same time also marked as refunded. And indeed, the refund was visible in my credit card bill the day after that.

So the method is to fetch money silently from the credit card, hoping that I won’t pay attention or won’t bother to do anything about it. Is there another definition for stealing? And I guess this method works fine with companies that have a lot of transactions of this sort with their credit cards. A few hundred dollars can easily slip away.

It appears like the counter-tactic is to use angry and ugly language right away. As long as the request for refund is polite and sounds like a calm person has written it, there’s still hope that the person writing it will give up or maybe forget about it.

And by the way, this post was published after receiving the refund, so unlike last time, it didn’t play a role in getting the issue resolved.

Avoiding unexpected withdrawals

The best way to avoid situations like this is of course to use a credit card with a short life span. This is the kind I used this time, but not three years ago.

Specifically with ssl.com, there are two things to do when ordering a certificate from them:

After purchasing, be sure that autorenewal is off. Click the “settings” link on the Dashboard, and uncheck “Automated Certificate Renewal”.
Also, delete the credit card details: Click on “deposit funds” on the Dashboard, and delete the credit card details.

Pushing eSigner, again

And now to a more subtle issue.

The approval for my one-year certificate came quickly enough, and it came with two suggestions for continuing: To start off immediately with eSigner, or to order a Yubikey from them with the certificate already loaded on it. The latter option costs $279 (even though it was included for free three years ago). Makes the eSigner option sound attractive, doesn’t it?

They didn’t mention using the Yubikey dongle that I already had and that I used for signing drivers. It was only when I asked about this option that they responded that there’s a procedure for loading a new certificate into the existing dongle.

And so I did, and filled the automatic form on their website, as required for obtaining a Yubikey-based certificate. And waited. And waited. Nothing happened. So I sent a reminder, got apologies for the delay, and finally got the certificate I had ordered.

Was this an innocent mishap, or a deliberate attempt to make me try out eSigner instead? As I’ve already had my fingers burnt with eSigner, no chance I would do that, but I can definitely imagine people losing their patience.

The Yubikey dongle costs $25

You can get your Yubikey dongle from ssl.com at $279, or buy it directly from Yubico at $25. This is the device that I got from ssl.com three years ago, and which I use with the renewed certificate after completing the attestation procedure.

The idea behind this procedure is that the secret key that is used for digital signatures is created inside the dongle, and is never revealed to the computer (or in any other way), so it can’t be stolen. The dongle generates a certificate (the “attestation certificate”) ensuring that the public key and secret key pair was indeed created this way, and is therefore safe. This certificate is signed with Yubico’s secret key, which is also present inside the dongle.

So the procedure consists of creating the key pair, obtaining the attestation certificate from the dongle and sending it to ssl.com by filling in a web form. They generate a certificate for signing code (or whatever is needed) in response.

So if you’re about to obtain your first certificate from ssl.com, I suggest checking up the option to purchase the Yubikey separately from Yubico. They have no reason to refuse from a security point of view, because the attestation certificate ensures that the cryptographic key is safe inside the dongle.

Summary

Exactly like three years ago, it seems like ssl.com uses fraudulent methods along with dirty tactics to cover up for their relatively low prices. So if you want to work with this company, be sure to keep a close eye on your credit card bill, and be ready for lengthy delays when requesting something that apparently goes against their interests. Plus misleading messages.

Also, be ready for a long exchange of emails with their support and billing department. It’s probably best to escalate to rude and aggressive language pretty soon, as their support team is probably instructed not to be cooperative as long as the person complaining appears to be calm.

And this comes from a company whose core business is generating trust.

Notes on ZTE ZXHN F601 GPON ONT

eli — Sun, 06 Aug 2023 11:18:07 +0000

Introduction

These are my notes while setting up ZTE’s ONT for GPON on a Linux desktop computer. I bought this thing from AliExpress at 20 USD, and got a cartoon box with the ONT itself, a power supply and a LAN cable.

This is a follow-up from a previous post of mine. I originally got a Nokia ONT when the fiber was installed, but I wanted an ONT that I can talk with. In particular, one that gives some info about the fiber link. Just in case something happens.

The cable of the 12V/0.5A power supply was too short for me, so I remained with the previous one (from Nokia’s ONT).

The software version of the ONT is V6.0.1P1T12 out of the box, which is certified by Bezeq. Couldn’t be better.

By default, this ONT acts as a GPON to Ethernet bridge. However, judging by its menus on the browser interface, it can also act as a router with one Ethernet port: If so requested, it apparently takes care of the PPPoE connection by itself, and is capable of supplying the whole package that comes with a router: NAT, a firewall, a DHCP server, a DNS and whatnot. I didn’t try any of this, so I don’t know how well it works. But it’s worth to keep these possibilities in mind.

In order to reset the ONT’s settings to the default values, press the RESET button with a needle for at least five seconds while the device is on (according to the user manual, didn’t try this).

So how come this thing isn’t sold at ten times the price, rebranded by some big shot company? I think the reason is this:

The PON LED is horribly misleading

According to the user guide, the PON LED is off when the registration has failed, blinking when registration is ongoing, and steadily on when registration is successful.

The problem is that registration doesn’t mean authentication. In other words, the fact that the PON LED is steadily on doesn’t mean that the other side (the OLT) is ready to start a PPPoE session. In particular, if the PON serial number is not set up correctly, the PON LED will be steadily on, even though the fiber link provider has rejected the connection.

Nokia’s modem’s PON led will blink when the serial number is wrong, and it makes sense: The PON is not good to go unless the authentication is successful. I suppose most other ONTs behave this way.

The only way to tell is through the browser interface. More about this below.

Browser interface

The ONT responds to pings and http at port 80 on address 192.168.1.1. A Chinese login screen appears. Switch language by clicking on where is says “English” at the login box’ upper right corner.

The username and password are both “admin” by default.

As already mentioned, this ONT has a lot of features. For me, there were two important ones: The ability to change the PON serial number, so I can replace ONTs without involving my ISP, and the ability to monitor the fiber link’s status and health. This can be crucial when spotting a problem:

(click to enlarge)

Note that in this screenshot, the GPON State is “Authentication Success”. This is what it should be. If it says “Registration Complete”, it means that the ONT managed through a few stages in the setup process, but the link isn’t up yet: The other side probably rejected the serial number (and/or the password, if such is used). And by the way, when the fiber wasn’t connected at all, it said “Init State”.

Also note the input power, around -27 dBm in my case. It depends on a lot of factors, among others the physical distance to the other fiber transmitter. It can also change if optical splitters are added or removed on the way. All this is normal. But each such change indicates that something has happened on the optical link. So it’s a good way to tell if people are fiddling with the optics, for better and for worse.

These are the changes I made on my box, relative to the default:

I turned the firewall off at Security > Firewall (was at “Low”). It’s actually possible to define custom rules, most likely based upon iptables. I don’t think the firewall operates when the ONT functions as a bridge, but just to be sure it won’t mess up.
In Security > Service Control, there’s an option for telnet access from WAN. Removed it.
In BPDU, disabled BPDU forwarding.

I don’t think any of these changes make any difference when using the ONT as a bridge.

Setting the PON serial number

Note to self: Look for a file named pon-serial-numbers.txt for the previous and new PON serial numbers.

When I first connected the ONT to the fiber, I was surprised to see that the PON LED flashed and then went steady. Say what? The network accepted the ONT’s default serial number without asking any questions?

I then looked at the “PON inform” status page (Status > Network Interface > PON Inform), and it said “Registration Complete”. Wow. That looked really reassuring. However, pppd was less happy with the situation. In fact, it had nobody to talk with:

Aug 06 10:56:21 pppd[36167]: Plugin rp-pppoe.so loaded.
Aug 06 10:56:21 pppd[36167]: RP-PPPoE plugin version 3.8p compiled against pppd 2.4.5
Aug 06 10:56:21 pppd[36168]: pppd 2.4.5 started by root, uid 0
Aug 06 10:56:56 pppd[36168]: Timeout waiting for PADO packets
Aug 06 10:56:56 pppd[36168]: Unable to complete PPPoE Discovery
Aug 06 10:56:56 pppd[36168]: Exit.

Complete silence from the other side. I was being ignored bluntly.

Note that I’m discussing the PPPoE topic in another post of mine.

Solution: I went into the Network > PON > SN menu in the browser interface, and copied the serial number that was printed on my previous ONT in full. It was something like ALCLf8123456. That is, four capital letters, followed by 8 hex digits. There’s also a place to fill in the password. Bezeq’s fiber network apparently doesn’t use a password, so I just wrote “none”. Clicked the “Submit” button, the ONT rebooted (it takes about a minute), and after that the Internet connection was up and running.

And of course, the GPON State appeared as “Authentication Success” in the “POD Inform” page.

So don’t trust the PON LED, and don’t get deceived by the words “Registration Complete”. Unless you feed the serial number that the fiber network provider expects, there’s nobody talking with you.

In fact, there’s an option in browser interface to turn off the LEDs altogether. It seemed like a weird thing to me at first, but maybe this is the Chinese workaround for this issue with the PON LED.

Bottom line

With the Internet link up and running, I ran a speed test. Exactly the same as the Nokia ONT.

So the final verdict is that this a really good ONT, which provides a lot of features and information. The only problem it apparently has is the confusing information regarding the PON link’s status when the serial number is incorrect. Which is probably the reason why this cute thing remains a Chinese no-name product.

http referer info missing in Apache logs for a non-https site

eli — Sun, 30 Jul 2023 07:26:11 +0000

I checked my Apache access logs, and noted that I saw no indications for people clicking links between two of my websites. It was extremely odd, because it was quite clear that at least a few such clicks should happen.

In the beginning, I though it was because of the rel=”noopener” part in the link. It shouldn’t have anything to do with this, but maybe it did? So no, that wasn’t the problem.

The issue was that if the link goes from an https site to a non-https site, the referer is blank. Why? Not 100% clear, but this is what Mozilla’s guidelines says. It has probably to do with pages with sensitive URLs (e.g. pages for resetting passwords). If the URL leaks through a non-secure http link (say, to a third-party server that supplies images, fonts and other stuff for the page), an eavesdropping attacker might get access to this URL.

And it so happens that this blog is a non-https site as of writing this. Mainly because I’m lazy.

On the other hand, when you read this, the site has been moved to https. Lazy or not, the missing referrer was the motivation I needed to finally do this.

Was it worth the effort? Well, so-so. Both Chrome nor Firefox submit a blank referrer if the link was non-https, even if a redirection to an https address is made. In other words, all existing links to a plain http address will remain hidden. But new links are expected to be based upon https, so at least they will be visible.

Well, partly: My own anecdotal test showed that Firefox indeed submits the full URL of the referrer for an https link, but Chrome gives away only the domain of the linking site. This is more secure of course: Don’t disclose a sensitive URL to a third party. And also, if you want to know who links to your page, go to Google’s search console. So chopping off the referrer also server Google to some extent.

Bottom line: It seems like the Referer thing is slowly fading away.

PPPoE on fiber with the Linux machine as the router

eli — Fri, 14 Jul 2023 13:44:05 +0000

Introduction

Having switched from ADSL to FTTH (fiber to the home), I was delighted to discover that the same script that set up the pppoe connection for ADSL also works with the new fiber connection. And as the title of this post implies, I opted out the external router that the ISP provided, and instead my Linux desktop box does the pppoe stuff. Instead, I went for a simple ONT (“fiber bridge”) which relays Ethernet packets between the fiber and an Ethernet jack.

This post is a spin-off from another post of mine, which discusses the transition to fiber in general.

Why am I making things difficult, you may ask? Actually, if you’re reading this there’s a good chance that you want to do the same, but anyhow: The reason for opting out an external router is the possibility to apply a sniffer on the pppoe negotiation if something goes wrong. To be able to tell the difference between rejected credentials and an ISP that doesn’t talk with me at all. This might hopefully help bringing back the link quicker if and when.

But then it turned out that even though the old setting works, the performance is quite bad: It was all nice when the data rate was limited to 15 Mb/s, but 1000 Mb/s is a different story.

So here’s my own little cookbook to pppoe for FTTH on a Linux desktop.

The “before”

The commands I used for ADSL were:

/usr/sbin/pppd pty /usr/local/etc/ADSL-pppoe linkname ADSL-$$ user "myusername@013net" remotename "10.0.0.138 RELAY_PPP1" defaultroute netmask 255.0.0.0 mtu 1452 mru 1452 noauth lcp-echo-interval 60 lcp-echo-failure 3 nobsdcomp usepeerdns

such that /usr/local/etc/ADSL-pppoe reads:

#!/bin/bash
/usr/sbin/pppoe -p /var/run/pppoe-adsl.pid.pppoe -I eth1 -T 80 -U  -m 1412

And of course, replace myusername@013net with your own username and assign the password in /etc/ppp/pap-secrets. Hopefully needless to say, the ADSL modem was connected to eth1.

This ran nicely for years with pppd version 2.4.5 and PPPoE Version 3.10, which are both very old. But never mind the versions of the software. pppoe and pppd are so established, that I doubt any significant changes have been made over the last 15 years or so.

Surprisingly enough, I got only 288 Mb/s download and 100 Mb/s upload on Netvision’s own speed test. The download speed should have been 1000 Mb/s (and the upload speed is as expected).

I also noted that pppoe ran at 75% CPU during the speed test, which made me suspect that it’s the bottleneck. Spoiler: It indeed was.

I tried a newer pppd (version 2.4.7) and pppoe (version 3.11) but that made no difference. As one would expect.

Superfluous options

Note that pppd gets unnecessary options that set the MTU and MRU to 1452 bytes. I suspected that these were the reason for pppoe working hard, so I tried without them. But there was no difference. They are redundant nevertheless.

Then we have the ‘remotename “10.0.0.138 RELAY_PPP1″ ‘ part, which I frankly don’t know why it’s there. Probably a leftover from the ADSL days.

Another thing is pppoe’s “-m 1412″ flag, which causes pppoe to mangle TCP packets with the SYN flag set, so that their MSS option is set to 1412 bytes, and not what was originally requested.

A quick reminder: The MSS option is the maximal size of IP packets that we can receive from the TCP stack on the other side. This option is used to tell the other side not to create TCP packets larger than this, in order to avoid fragmentation of arriving packets.

It is actually a good idea to mangle the MSS on outgoing TCP packets, as explained further below. But the 1412 bytes value is archaic, copied from the pppoe man page or everyone copies from each other. 1452 is a more sensible figure. But it doesn’t matter all that much, because I’m about to scrap the pppoe command altogether. Read on.

Opening the bottleneck

The solution is simple: Use pppoe in the kernel.

There’s a whole list of kernel modules that need to be available (or compiled into the kernel), but any sane distribution kernel will have them included. I suppose CONFIG_PPPOE is the kernel option to check out.

The second thing is that pppd should have the rp-pppoe.so plugin available. Once again, I don’t think you’ll find a distribution package for pppd without it.

With these at hand, I changed the pppd command to:

/usr/sbin/pppd plugin rp-pppoe.so eth1 linkname ADSL-$$ user "myusername@013net" defaultroute netmask 255.0.0.0 noauth lcp-echo-interval 60 lcp-echo-failure 3 nobsdcomp usepeerdns

That’s exactly the same as above, but instead of the “pty” option that calls an external script, I use the plugin to talk with eth1 directly. No pppoe executable to eat CPU, and the transmission speed goes easily up to >900 Mb/s without any dramatic CPU consumption visible (“top” reports 8% system CPU at worst, and that’s global to all CPUs).

I also removed the options for setting MTU and MRU in the pppd command. ppp0 now presents an MTU of 1492, which I suppose is correct. I mean, why fiddle with this? And I ditched the “remotename” option too.

Once again, the ONT (“fiber bridge”) was connected to eth1.

Samples of log output

This is the comparison between pppd’s output with pppoe as an executable and with the kernel’s pppoe module:

First, the old way, with pppoe executable:

Using interface ppp0
Connect: ppp0 <--> /dev/pts/13
PAP authentication succeeded
local  IP address 109.186.24.16
remote IP address 212.143.8.104
primary   DNS address 194.90.0.1
secondary DNS address 212.143.0.1

And with pppoe inside the kernel:

Plugin rp-pppoe.so loaded.
RP-PPPoE plugin version 3.8p compiled against pppd 2.4.5
PPP session is 3865
Connected to 00:1a:f0:87:12:34 via interface eth1
Using interface ppp0
Connect: ppp0 <--> eth1
PAP authentication succeeded
peer from calling number 00:1A:F0:87:12:34 authorized
local  IP address 109.186.4.18
remote IP address 212.143.8.104
primary   DNS address 194.90.0.1
secondary DNS address 212.143.0.1

The MAC address that is mentioned seems to be owned by Alcatel-Lucent, and is neither my own host’s or the ONT’s (i.e. the “fiber adapter”). It appears to belongs to the link partner over the fiber connection.

And by the way, if the ISP credentials are incorrect, the row saying “Connect X <–> Y” is followed by “LCP: timeout sending Config-Requests” after about 30 seconds. Instead of “PAP authentication succeeded”, of course.

Clamping MSS

The pppoe user-space utility had this nice “-m” option that caused all TCP packets with a SYN to be mangled, so that their MSS field was set as required for the pppoe link. But now I’m not using it anymore. How will the MSS field be correct now?

First of all, this is not an issue for packets that are created on the same computer that runs pppd. ppp0′s MTU is checked by the TCP stack, and the MSS is set correctly to 1452.

But forwarded packets come from a source that doesn’t know about ppp0′s reduced MTU. That host sets the MSS according to the NIC that it sees. It can’t know that this MSS may be too large for the pppoe link in the middle.

The solution is to add a rule in the firewall that mangles these packets:

iptables -A FORWARD -o ppp0 -t mangle -p tcp --tcp-flags SYN,RST SYN -j TCPMSS --clamp-mss-to-pmtu

This is more or less copied from iptables’ man page. I added the -o part, because this is relevant only for packets going out to ppp0. No point mangling all forwarded packets.

A wireshark dump

This is what wireshark shows on the Ethernet card that is connected to the ONT during a successful connection to the ISP. It would most likely have looked the same on an ADSL link.

No.     Time           Source                Destination           Protocol Length Info
      3 0.142500324    Dell_11:22:33         Broadcast             PPPoED   32     Active Discovery Initiation (PADI)
      4 0.144309286    Alcatel-_87:12:34     Dell_11:22:33         PPPoED   60     Active Discovery Offer (PADO) AC-Name='203'
      5 0.144360515    Dell_11:22:33         Alcatel-_87:12:34     PPPoED   52     Active Discovery Request (PADR)
      6 0.146062649    Alcatel-_87:12:34     Dell_11:22:33         PPPoED   60     Active Discovery Session-confirmation (PADS)
      7 0.147037263    Dell_11:22:33         Alcatel-_87:12:34     PPP LCP  36     Configuration Request
      8 0.192272315    Alcatel-_87:12:34     Dell_11:22:33         PPP LCP  60     Configuration Request
      9 0.192290554    Alcatel-_87:12:34     Dell_11:22:33         PPP LCP  60     Configuration Ack
     10 0.192335094    Dell_11:22:33         Alcatel-_87:12:34     PPP LCP  40     Configuration Ack
     11 0.192516908    Dell_11:22:33         Alcatel-_87:12:34     PPP LCP  30     Echo Request
     12 0.192660752    Dell_11:22:33         Alcatel-_87:12:34     PPP PAP  50     Authenticate-Request (Peer-ID='myusername@013net', Password='mypassword')
     13 0.201978697    Alcatel-_87:12:34     Dell_11:22:33         PPP LCP  60     Echo Reply
     14 0.309272346    Alcatel-_87:12:34     Dell_11:22:33         PPP PAP  60     Authenticate-Ack (Message='')
     15 0.309286268    Alcatel-_87:12:34     Dell_11:22:33         PPP IPCP 60     Configuration Request
     16 0.309289064    Alcatel-_87:12:34     Dell_11:22:33         PPP IPV6CP 60     Configuration Request
     17 0.309398416    Dell_11:22:33         Alcatel-_87:12:34     PPP IPCP 44     Configuration Request
     18 0.309429731    Dell_11:22:33         Alcatel-_87:12:34     PPP IPCP 32     Configuration Ack
     19 0.309441755    Dell_11:22:33         Alcatel-_87:12:34     PPP LCP  42     Protocol Reject
     20 0.315313539    Alcatel-_87:12:34     Dell_11:22:33         PPP IPCP 60     Configuration Nak
     21 0.315365821    Dell_11:22:33         Alcatel-_87:12:34     PPP IPCP 44     Configuration Request
     22 0.321070570    Alcatel-_87:12:34     Dell_11:22:33         PPP IPCP 60     Configuration Ack

These “Configuration Request” and “Configuration Ack” packets contain a lot of data, of course: This is where the local and remote IP addresses are given, as well as the addresses to the DNSes.

Some random notes

On a typical LAN connection over Ethernet, MSS is set to 1460. The typical value for a pppoe connection is 1452, 8 bytes lower.
Add “nodetach” to pppd’s command for a (debug) foreground session.
Add “dump” to pppd’s command to see all options in effect (from option file and command line combined).

Google Chrome: Stop that nagging on updates

eli — Sun, 11 Jun 2023 08:53:01 +0000

I have Google Chrome installed on a Linux machine at /opt/google as root, so the browser can’t update itself automatically. Instead, it complains with this pop-up every time the browser is started:

What I really like about this pop-up is the “you’re missing out” part. I get the same thing from the silly image gallery app on my Google Pixel phone. This is Google trying to play on my (not so existent) FOMO.

It has been suggested to add the –simulate-outdated-no-au argument to the command line that executes Chrome. This works indeed. The common suggestion is however to do that on the shortcut that executes the browser. But that won’t cover the case when I run the browser from a shell. Something I do, every now and then. Don’t ask.

So a more sledge hammer solution is to edit the wrapper script:

$ which google-chrome
/usr/bin/google-chrome

So edit this file (as root), and change the last line from

exec -a "$0" "$HERE/chrome" "$@"

exec -a "$0" "$HERE/chrome" --simulate-outdated-no-au='Tue, 31 Dec 2099' "$@"

What does this mean, then? Well, according to the list of Google Chrome switches, this switch “simulates that current version is outdated and auto-update is off”. The date is referred to in the source’s upgrade_detector_impl.cc. Look there if you want to figure out why this works (I didn’t bother, actually).

Using git send-email with Gmail + OAUTH2, but without subscribing to cloud services

eli — Sun, 30 Oct 2022 09:08:44 +0000

Introduction

There is a widespread belief, that in order to use git send-email with Gmail, there’s a need to subscribe to Google Cloud services and obtain some credentials. Or that a two-factor authentication (2fa) is required.

This is not the case, however. If Thunderbird can manage to fetch and send emails through Google’s mail servers (as well as other OAUTH2 authenticated mail services), there’s no reason why a utility won’t be able to do the same.

The subscription to Google’s services is indeed required if the communication with Google’s server must be done without human supervision. That’s the whole point with API keys. If a human is around when the mail is dispatched, there’s no need for any special measures. And it’s quite obvious that there’s a responsive human around when a patch is being submitted.

What is actually needed, is a client ID and a client secret, and these are indeed obtained by registering to Google’s cloud service (this explains how). But here’s the thing: Someone at Mozilla has already obtained these, and hardcoded them into Thunderbird itself. So there’s no problem using these to access Gmail with another mail client. It seems like many believe that the client ID and secret must be related to the mail account to access, and therefore each and every one has to obtain their own pair. That’s a mistake that has made a lot of people angry for nothing.

This post describes how to use git send-email without any further involvement with Google, except for having a Gmail account. The same method surely applies for other mail service providers that rely on OAUTH2, but I haven’t gotten into that. It should be quite easy to apply the same idea to other services as well however.

For this to work, Thunderbird must be configured to access the same email account. This doesn’t mean that you actually have to use Thunderbird for your mail exchange. It’s actually enough to configure the Gmail server as an outgoing mail server for the relevant account. In other words, you don’t even need to fetch mails from the server with Thunderbird.

The point is to make Thunderbird set up the OAUTH2 session, and then fetch the relevant piece of credentials from it. And take it from there with Google’s servers. Thunderbird is a good candidate for taking care of the session’s setup, because the whole idea with OAUTH2 is that the user / password session (plus possible additional authentication challenges) is done with a browser. Since Thunderbird is Firefox in disguise, it integrates the browser session well into its general flow.

If you want to use another piece of software to maintain the OAUTH2 session, that’s most likely possible, given that you can get its refresh token. This will also require obtaining its client ID and client secret. Odds are that it can be found somewhere in that software’s sources, exactly as I found it for Thunderbird. Or look at the https connection it runs to get an access token (which isn’t all that easy, encryption and that).

Outline of solution

All below relates to Linux Mint 19, Thunderbird 91.10.0, git version 2.17.1, Perl 5.26 and msmtp 1.8.14. But except for Thunderbird and msmtp, I don’t think the versions are going to matter.

It’s highly recommended to read through my blog post on OAUTH2, in particular the section called “The authentication handshake in a nutshell”. You’re going to need to know the difference between an access token and a refresh token sooner or later.

So the first obstacle is the fact that git send-email relies on the system’s sendmail to send out the emails. That utility doesn’t support OAUTH2 at the time of writing this. So instead, I used msmtp, which is a drop-in replacement for sendmail, plus it supports OAUTH2 (since version 1.8.13).

msmtp identifies itself to the server by sending it an access token in the SMTP session (see a dump of a sample session below). This access token is short-lived (3600 seconds from Google as of writing this), so it can’t be fetched from Thunderbird just like that. In particular because most of the time Thunderbird doesn’t have it.

What Thunderbird does have is a refresh token. It’s a completely automatic task to ask Google’s server for the access token with the refresh token at hand. It’s also an easy task (once you’ve figured out how to do it, that is). It’s also easy to get the refresh token from Thunderbird, exactly in the same way as getting a saved password. In fact, Thunderbird treats the refresh token as a password.

msmtp allows executing an arbitrary program in order to get the password or the access token. So I wrote a Perl script (oauth2-helper.pl) that reads the refresh token from a file and gets an access token from Google’s server. This is how msmtp manages to authenticate itself.

So everything relies on this refresh token. In principle, it can change every time it’s used. In practice, as of today, Google’s servers don’t change it. It seems like the refresh token is automatically replaced every six months, but even if that’s true today, it may change.

But that doesn’t matter so much. All that is necessary is that the refresh token is correct once. If the refresh token goes out of sync with Google’s server, a simple user / password session rectifies this. And as of now, than virtually never happens.

So let’s get to the hands-on part.

Install msmtp

Odds are that your distribution offers msmtp, so it can be installed with something like

# apt install msmtp

Note however that the version needs to be at least 1.8.13, which wasn’t my case (Linux Mint 19). So I installed it from the sources. To do that, first install the TLS library, if it’s not installed already (as root):

# apt install gnutls-dev

Then clone the git repository, compile and install:

$ GIT_SSL_NO_VERIFY=true git clone http://git.marlam.de/git/msmtp.git
$ cd msmtp
$ git checkout msmtp-1.8.14
$ autoreconf -i
$ ./configure
$ make && echo Success
$ sudo make install

The installation goes to /usr/local/bin and other /usr/local/ paths, as one would expect.

I checked out version 1.8.14 because later versions failed to compile on my Linux Mint 19. OAUTH2 support was added in 1.8.13, and judging by the commit messages it hasn’t been changed since, except for commit 1f3f4bfd098, which is “Send XOAUTH2 in two lines, required by Microsoft servers”. Possibly cherry-pick this commit (I didn’t).

Once everything has been set up as described below, it’s possible to send an email with

$ msmtp -v -t < ~/email.eml

The -v flag is used only for debugging, and it prints out the entire SMTP session.

The -t flag tells msmtp to fetch the recipients from the mail’s own headers. Otherwise, the recipients need to be listed in the command line, just like sendmail. Without this flag or recipients, msmtp just replies with

msmtp: no recipients found

The -t flag isn’t necessary with git send-email, because it explicitly lists the recipients in the command line.

The oauth2-helper.pl script

As mentioned above, Thunderbird has the refresh token, but msmtp needs an access token. So the script that talks with Google’s server and grabs the access token can be downloaded from its Github repo. Save it, with execution permission to /usr/local/bin/oauth2-helper.pl (or whatever, but this is what I assume in the configurations below).

Some Perl libraries may be required to run this script. On a Debian-based system, the packages’ names are probably something like libhttp-message-perl, libwww-perl and libjson-perl.

It’s written to access Google’s token server, but can be modified easily to access a different service provider by changing the parameters at its beginning. For other email providers, check if it happens to be listed in OAuth2Providers.sys.mjs. I don’t know how well it will work with those other providers, though.

The script reads the refresh token from ~/.oauth2_reftoken as a plain file containing the blob only. There’s an inherent security risk of having this token stored like this, but it’s basically the same risk as the fact that it can be obtained from Thunderbird’s credential files. The difference is the amount of security by obscurity. Anyhow, the reference token isn’t your password, and it can’t be derived from it. Either way, make sure that this file has a 0600 or 0400 permission, if you’re running on a multi-user computer.

The script caches the access token in ~/.oauth2_acctoken, with an expiration timestamp. As of today, it means that the script talks with the Google’s server once in 60 minutes at most.

Setting up config files

So with msmtp installed and the script downloaded into /usr/local/bin/oauth2-helper.pl, all that is left is configuration files.

First, create ~/.msmtprc as follows (put your Gmail username instead of mail.username, of course):

account default
host smtp.gmail.com
port 587
tls on
tls_starttls on
auth xoauth2
user mail.username
passwordeval /usr/local/bin/oauth2-helper.pl
from mail.username@gmail.com

And then change the [sendemail] section in ~/.gitconfig to

[sendemail]
        smtpServer = /usr/local/bin/msmtp

That’s it. Only that single line. It’s however possible to use smtpServerOption in the .gitconfig to add various flags. So for example, to get the entire SMTP session shown while sending the email, it should say:

[sendemail]
        smtpServer = /usr/local/bin/msmtp
        smtpServerOption = -v

But really, don’t, unless there’s a problem sending mails.

Other than that, don’t keep old settings. For example, there should not be a “from=” entry in .gitconfig. Having such causes a “From:” header to be added into the mail body (so it’s visible to the reader of the mail). This header is created when there is a difference between the “From” that is generated by git send-email (which is taken from the “from=” entry) and the patch’ author, as it appears in the patch’ “From” header. The purpose of this in-body header is to tell “git am” who the real author is (i.e. not the sender of the patch). So this extra header won’t appear in the commit, but it nevertheless makes the sender of the message look somewhat clueless.

So in short, no old junk.

Sending a patch

Unless it’s the first time, I suggest just trying to send the patch to your own email address, and see if it works. There’s a good chance that the refresh token from the previous time will still be good, so it will just work, and no point hassling more.

Actually, it’s fine to try like this even on the first time, because the Perl script will fail to grab the access token and then tell you what to do to fix it, namely:

Make sure that Thunderbird has access to the mail account itself, possibly by attempting to send an email through Gmail’s server.
Go to Thunderbird’s Preferences > Privacy & Security and click on Saved Passwords. Look for the account, where the Provider start with oauth://. Right-click that line and choose “Copy Password”.
Create or open ~/.oauth2_reftoken, and paste the blob into that file, so it contains only that string. No need to be uptight with newlines and whitespaces: They are ignored.

And then go, as usual:

$ git send-email --to 'my@test.mail' 0001-my.patch

I’ve added the output of a successful session (with the -v flag) below.

Room for improvements

It would have been nicer to fetch the refresh token automatically from Thunderbird’s credentials store (that is from logins.json, based upon the decryption key that is kept in key4.db), but the available scripts for that are written in Python. And to me Python is equal to “will cause trouble sooner or later”. Anyhow, this tutorial describes the mechanism (in the part about Firefox).

Besides, it could have been even nicer if the script was completely standalone, and didn’t depend on Thunderbird at all. That requires doing the whole dance with the browser, something I have no motivation to get into.

A successful session

This is what it looks like when a patch is properly sent, with the smtpServerOption = -v line in .gitignore (so msmtp produces verbose output):

Send this email? ([y]es|[n]o|[q]uit|[a]ll): y
ignoring system configuration file /usr/local/etc/msmtprc: No such file or directory
loaded user configuration file /home/eli/.msmtprc
falling back to default account
Fetching access token based upon refresh token in /home/eli/.oauth2_reftoken...
using account default from /home/eli/.msmtprc
host = smtp.gmail.com
port = 587
source ip = (not set)
proxy host = (not set)
proxy port = 0
socket = (not set)
timeout = off
protocol = smtp
domain = localhost
auth = XOAUTH2
user = mail.username
password = *
passwordeval = /usr/local/bin/oauth2-helper.pl
ntlmdomain = (not set)
tls = on
tls_starttls = on
tls_trust_file = system
tls_crl_file = (not set)
tls_fingerprint = (not set)
tls_key_file = (not set)
tls_cert_file = (not set)
tls_certcheck = on
tls_min_dh_prime_bits = (not set)
tls_priorities = (not set)
tls_host_override = (not set)
auto_from = off
maildomain = (not set)
from = mail.username@gmail.com
set_from_header = auto
set_date_header = auto
remove_bcc_headers = on
undisclosed_recipients = off
dsn_notify = (not set)
dsn_return = (not set)
logfile = (not set)
logfile_time_format = (not set)
syslog = (not set)
aliases = (not set)
reading recipients from the command line
<-- 220 smtp.gmail.com ESMTP m8-20020a7bcb88000000b003c6d21a19a0sm3316430wmi.29 - gsmtp
--> EHLO localhost
<-- 250-smtp.gmail.com at your service, [109.186.183.118]
<-- 250-SIZE 35882577
<-- 250-8BITMIME
<-- 250-STARTTLS
<-- 250-ENHANCEDSTATUSCODES
<-- 250-PIPELINING
<-- 250-CHUNKING
<-- 250 SMTPUTF8
--> STARTTLS
<-- 220 2.0.0 Ready to start TLS
TLS session parameters:
    (TLS1.2)-(ECDHE-ECDSA-SECP256R1)-(CHACHA20-POLY1305)
TLS certificate information:
    Subject:
        CN=smtp.gmail.com
    Issuer:
        C=US,O=Google Trust Services LLC,CN=GTS CA 1C3
    Validity:
        Activation time: Mon 26 Sep 2022 11:22:04 AM IDT
        Expiration time: Mon 19 Dec 2022 10:22:03 AM IST
    Fingerprints:
        SHA256: 53:F3:CA:1D:37:F2:1F:ED:2C:67:40:A2:A2:29:C2:C8:E8:AF:9E:60:7A:01:92:EC:F0:2A:11:E8:37:A5:88:F3
        SHA1 (deprecated): D4:69:6E:59:2D:75:43:59:02:74:25:67:E7:57:40:E0:28:43:A8:62
--> EHLO localhost
<-- 250-smtp.gmail.com at your service, [109.186.183.118]
<-- 250-SIZE 35882577
<-- 250-8BITMIME
<-- 250-AUTH LOGIN PLAIN XOAUTH2 PLAIN-CLIENTTOKEN OAUTHBEARER XOAUTH
<-- 250-ENHANCEDSTATUSCODES
<-- 250-PIPELINING
<-- 250-CHUNKING
<-- 250 SMTPUTF8
--> AUTH XOAUTH2 dXNlcj1lbGkuYmlsbGF1ZXIBYXV0aD1CZWFyZXIgeWEyOS5hMEFhNHhyWE1GM1gtOTJMVWNidjE4MFdVOBROENRcUdSbk5KaUFSY0VSckVaXzdzbDlHMTNpdFIyUTk0NjlKWG45aHVGLQVRBU0FSTVXJpSjRqMjBLcWh6WU9GekxlcU5BYVpFNUU4WXRhNjdLUXpCRm1HRDg3dFgzeHJ4amNPTnRVTkZFVWdESXhsUlcxOFhVT0pqQ1hPSlFwZlNGUUVqRHZMOWw4RExkTjlKZlNbGRTazNNbFNMNjVfQWFDZ1lLVVF2Y0luOWNSSUEwMTY2AQE=
<-- 235 2.7.0 Accepted
--> MAIL FROM:
--> RCPT TO:
--> RCPT TO:
--> DATA
<-- 250 2.1.0 OK m8-20020a7bcb88000000b003c6d21a19a0sm3316430wmi.29 - gsmtp
<-- 250 2.1.5 OK m8-20020a7bcb88000000b003c6d21a19a0sm3316430wmi.29 - gsmtp
<-- 250 2.1.5 OK m8-20020a7bcb88000000b003c6d21a19a0sm3316430wmi.29 - gsmtp
<-- 354  Go ahead m8-20020a7bcb88000000b003c6d21a19a0sm3316430wmi.29 - gsmtp
--> From: Eli Billauer 
--> To: test@mail.com
--> Cc: Eli Billauer 
--> Subject: [PATCH v8] Gosh! Why don't you apply this patch already!
--> Date: Sun, 30 Oct 2022 07:01:14 +0200
--> Message-Id: <20221030050114.49299-1-mail.username@gmail.com>
--> X-Mailer: git-send-email 2.17.1
--> 

[ ... email body comes here ... ]

--> --
--> 2.17.1
-->
--> .
<-- 250 2.0.0 OK  1667106108 m8-20020a7bcb88000000b003c6d21a19a0sm3316430wmi.29 - gsmtp
--> QUIT
<-- 221 2.0.0 closing connection m8-20020a7bcb88000000b003c6d21a19a0sm3316430wmi.29 - gsmtp
OK. Log says:
Sendmail: /usr/local/bin/msmtp -v -i test@mail.com mail.username@gmail.com
From: Eli Billauer 
To: test@mail.com
Cc: Eli Billauer 
Subject: [PATCH v8] Gosh! Why don't you apply this patch already!
Date: Sun, 30 Oct 2022 07:01:14 +0200
Message-Id: <20221030050114.49299-1-mail.username@gmail.com>
X-Mailer: git-send-email 2.17.1

Result: OK

Ah, and the fact that the access token can be copied from here is of course meaningless, as it has expired long ago.

Thunderbird debug notes

These are some random notes I made while digging in Thunderbird’s guts to find out what’s going on.

So this is Thunderbird’s official git repo. Not that I used it.

To get logging info from Thunderbird: Based upon this page, go to Thunderbird’s preferences > General and click the Config Editor button. Set mailnews.oauth.loglevel to All (was Warn). Same with mailnews.smtp.loglevel. Then open the Error Console with Ctrl+Shift+J.

The cute thing about these logs is that the access code is written in the log. So it’s possible to skip the Perl script, and use the access code from Thunderbird’s log. Really inconvenient, but possible.

The OAuth2 token requests is implemented in Oauth2.jsm. It’s possible to make a breakpoint in this module by through Tools > Developer Tools > Developer Toolbox, and once it opens (after requesting permission for external connection), go to the debugger.

Find Oauth2.jsm in the sources pane to the left (of the Debugger tab), under resource:// modules > sessionstore. Add a breakpoint in requestAccessToken() so that the clientID and consumerSecret properties can be revealed.

Sending a patch from Thunderbird directly

This is a really bad idea. But if you have Thunderbird, and need to send a patch right now, this is a quick, dirty and somewhat dangerous procedure for doing that.

Why is it dangerous? Because at some point, it’s easy to pick “Send now” instead of “Send later”, and boom, a junk patch is mailed to the whole world.

The problem with Thunderbird is that it makes small changes into the patch’ body. So to work around this, there’s a really silly procedure. I used it once, and I’m not proud of that.

So here we go.

First, a very simple script that outputs the patch mail into a file. Say that I called it dumpit (should be executable, of course):

#!/bin/bash

cat > /home/eli/Desktop/git-send-email.eml

Then change ~/.gitconfig, so it reads something like this in the [sendemail] section:

[sendemail]
        from = mail.username@gmail.com
        smtpServer = /home/eli/Desktop/dumpit

So basically it uses the silly script as a mail server, and the content goes out to a plain file.

Then run git send-email as usual. The result is a git-send-email.eml as a file.

And now comes the part of making Thunderbird send it.

Close Thunderbird. All windows.
Change directory to where Thunderbird keeps its profile files, to under Mail/Local Folders
Remove “Unsent Messages” and “Unsent Messages.msf”
Open Thunderbird again
Inside Thunderbird, go to Hamburger Icon > File > Open > Saved Message… and select git-send-email.eml. The email message should appear.
Right-Click somewhere in the message’s body, and pick Edit as New Message…
Don’t send this message as is! It’s completely messed up. In particular, there are some indentations in the patch itself, which renders it useless.
Instead, pick File > Send Later.
Once again, close Thunderbird. All windows.
Remove “Unsent Messages.msf” (only)
Edit “Unsent Messages” as follows: Everything under the “Content-Transfer-Encoding: 7bit” part is the mail’s body. So remove the “From:” line after it, and paste the email’s body from git-send-email.eml instead.
Note that there are normally two blank lines after the mail’s body. Retain them.
Open Thunderbird again. Verify that those indentations are away.
Look at the mail inside Outbox, and verify that it’s OK now. These are the three things to look for in particular:
- The “From:” part at the beginning of the message is gone.
- At the end of the message, there’s a “–” and git’s version number. These should be in separate lines.
- Look at the mail’s source. The “+” and “-” signs of the diffs must not be indented.
If all is fine, right-click Outbox, and pick “Send unsent messages”. And hope for good.

Are you sure you want to do this?

Blocking bots by their IP addresses, the DIY version

eli — Tue, 16 Aug 2022 10:26:37 +0000

Introduction

I had some really annoying bots on one of my websites. Of the sort that make a million requests (like really, a million) per month, identifying themselves as a browser.

So IP blocking it is. I went for a minimalistic DIY approach. There are plenty of tools out there, but my experience with things like this is that in the end, it’s me and the scripts. So I might as well write them myself.

The IP set feature

Iptables has an IP set module, which allows feeding it with a set of random IP addresses. Internally, it creates a hash with these addresses, so it’s an efficient way to keep track of multiple addresses.

IP sets has been in the kernel since ages, but it has to be opted in the kernel with CONFIG_IP_SET. Which it most likely is.

The ipset utility may need to be installed, with something like

# apt install ipset

There seems to be a protocol mismatch issue with the kernel, which apparently is a non-issue. But every time something goes wrong with ipset, there’s a warning message about this mismatch, which is misleading. So it looks something like this.

# ipset [ ... something stupid or malformed ... ]
ipset v6.23: Kernel support protocol versions 6-7 while userspace supports protocol versions 6-6
[ ... some error message related to the stupidity ... ]

So the important thing is to be aware of is that odds are that the problem isn’t the version mismatch, but between chair and keyboard.

Hello, world

A quick session

# ipset create testset hash:ip
# ipset add testset 1.2.3.4
# iptables -I INPUT -m set --match-set testset src -j DROP
# ipset del testset 1.2.3.4

Attempting to add an IP address that is already in the list causes a warning, and the address isn’t added. So no need to check if the address is already there. Besides, there the -exist option, which is really great.

List the members of the IP set:

# ipset -L

Timeout

An entry can have a timeout feature, which works exactly as one would expect: The rule vanishes after the timeout expires. The timeout entry in ipset -L counts down.

For this to work, the set must be created with a default timeout attribute. Zero means that timeout is disabled (which I chose as a default in this example).

# ipset create testset hash:ip timeout 0
# ipset add testset 1.2.3.4 timeout 10

The ‘-exist’ flag causes ipset to re-add an existing entry, which also resets its timeout. So this is the way to keep the list fresh.

Don’t put the DROP rule first

It’s tempting to put the DROP rule with –match-set first, because hey, let’s give those intruders the boot right away. But doing that, there might be TCP connections lingering, because the last FIN packet is caught by the firewall as the new rule is added. Given that adding an IP address is the result of a flood of requests, this is a realistic scenario.

The solution is simple: There’s most likely a “state RELATED,ESTABLISHED” rule somewhere in the list. So push it to the top. The rationale is simple: If a connection has begun, don’t chop it in the middle in any case. It’s the first packet that we want killed.

Persistence

The rule in iptables must refer to an existing set. So if the rule that relies on the set is part of the persistent firewall rules, it must be created before the script that brings up iptables runs.

This is easily done by adding a rule file like this as /usr/share/netfilter-persistent/plugins.d/10-ipset

#!/bin/sh

IPSET=/sbin/ipset
SET=mysiteset

case "$1" in
start|restart|reload|force-reload)
	$IPSET destroy
	$IPSET create $SET hash:ip timeout 0
	;;

save)
	echo "ipset-persistent: The save option does nothing"
	;;

stop|flush)
	$IPSET flush $SET
	;;
*)
    echo "Usage: $0 {start|restart|reload|force-reload|save|flush}" >&2
    exit 1
    ;;
esac

exit 0

The idea is that the index 10 in the file’s name is smaller than the rule that sets up iptables, so it runs first.

This script is a dirty hack, but hey, it works. There’s a small project on this, for those who like to do it properly.

The operating system in question is systemd-based, but this old school style is still in effect.

Maybe block by country?

Since all offending requests came from the same country (cough, cough, China, from more than 4000 different IP addresses) I’m considering to block them in one go. A list of 4000+ IP addresses that I busted in August 2022 with aggressive bots (all from China) can be downloaded as a simple compressed text file.

So the idea is going something like

ipset create foo hash:net
ipset add foo 192.168.0.0/24
ipset add foo 10.1.0.0/16
ipset add foo 192.168.0/24

and download the per-country IP ranges from IP deny. That’s a simple and crude tool for denial by geolocation. The only thing that puts me down a bit is that it’s > 7000 rules, so I wonder if that doesn’t put a load on the server. But what really counts is the number of sizes of submasks, because each submask size has its own hash. So if the list covers all possible sizes, from a full /32 down to say, 16/, there are 17 hashes to look up for each packet arriving.

On the other hand, since the rule should be after the “state RELATED,ESTABLISHED” rule, it only covers SYN packets. And if this whole thing is put as late as possible in the list of rules, it boils down to handling only packets that are intended for the web server’s ports, or those that are going to be dropped anyhow. So compared with the CPU cycles of handling the http request, even 17 hashes isn’t all that much.

The biggest caveat is however if other websites are colocated on the server. It’s one thing to block offending IPs, but blocking a whole country from all sites, that’s a bit too much.

Note to self: In the end, I wrote a little Perl-XS module that says if the IP belongs to a group. Look for byip.pm.

The blacklisting script

The Perl script that performs the blacklisting is crude and inaccurate, but simple. This is the part to tweak and play with, and in particular adapt to each specific website. It’s all about detecting abnormal access.

Truth to be told, I replaced this script with a more sophisticated mechanism pretty much right away on my own system. But what’s really interesting is the calls to ipset.

This script reads through Apache’s access log file, and analyzes each minute in time (as in 60 seconds). In other words, all accesses that have the same timestamp, with the seconds part ignored. Note that the regex part that captures $time in the script ignores the last part of :\d\d.

If the same IP address appears more than 50 times, that address is blacklisted, with a timeout of 86400 seconds (24 hours). Log file that correspond to page requisites and such (images, style files etc.) are skipped for this purpose. Otherwise, it’s easy to reach 50 accesses within a minute with legit web browsing.

There are several imperfections about this script, among others:

Since it reads through the entire log file each time, it keeps relisting each IP address until the access file is rotated away, and a new one is started. This causes an update of the timeout, so effectively the blacklisting takes place for up to 48 hours.
Looking in segments of accesses that happen to have the same minute in the timestamp is quite inaccurate regarding which IPs are caught and which aren’t.

The script goes as follows:

#!/usr/bin/perl
use warnings;
use strict;

my $logfile = '/var/log/mysite.com/access.log';
my $limit = 50; # 50 accesses per minute
my $timeout = 86400;

open(my $in, "<", $logfile)
  or die "Can't open $logfile for read: $!\n";

my $current = '';
my $l;
my %h;
my %blacklist;

while (defined ($l = <$in>)) {
  my ($ip, $time, $req) = ($l =~ /^([^ ]+).*?\[(.+?):\d\d[ ].*?\"\w+[ ]+([^\"]+)/);
  unless (defined $ip) {
    #    warn("Failed to parse line $l\n");
    next;
  }

  next
    if ($req =~ /^\/(?:media\/|robots\.txt)/);

  unless ($time eq $current) {
    foreach my $k (sort keys %h) {
      $blacklist{$k} = 1
	if ($h{$k} >= $limit);
    }

    %h = ();
    $current = $time;
  }
  $h{$ip}++;
}

close $in;

foreach my $k (sort keys %blacklist) {
  system('/sbin/ipset', 'add', '-exist', 'mysiteset', $k, 'timeout', $timeout);
}

It has to be run as root, of course. Most likely as a cronjob.

Google Translate, LaTeX and asian languages: Technical notes

eli — Mon, 15 Aug 2022 07:18:50 +0000

Introduction

These post contains a few technical notes of using Google Translate for translating LaTeX documents into Chinese, Japanese and Korean. The insights on the language-related issues are written down in a separate post.

Text vs. HTML

Google’s cloud translator can be fed with either plain text or HTML, and it returns the same format. Plain text format is out of the question for anything but translating short sentences, as it becomes impossible to maintain the text’s formatting. So I went for the HTML interface.

The thing with HTML is that whitespaces can take different forms and shapes, and they are redundant in many situations. For example, a newline is often equivalent to a plain space, and neither make any difference between two paragraphs that are enclosed by

tags.

Google Translate takes this notion to the extreme, and typically removes all newlines from the original text. OK, that’s understandable. But it also adds and removes whitespaces where it had no business doing anything, in particular around meaningless segments that aren’t translated anyhow. This makes it quite challenging when feeding the results for further automatic processing.

Setting up a Google Cloud account

When creating a new Google Cloud account, there’s an automatic credit of $300 to spend for three months. So there’s plenty of room for much needed experimenting. Too see the status of the evaluation period, go to Billing > Cost Breakdown and wait a minute or so for the “Free trial status” strip to appear at the top of the page. There’s no problem with “activating full account” immediately. The free trial credits remain, but it also means that real billing occurs when the credits are consumed and/or the trial period is over.

First create a new Google cloud account and enable the Google Translate API.

I went for Basic v2 translation (and not Advanced, v3). Their pricing is the same, but v3 is not allowed with an API key, and I really wasn’t into setting up a service account and struggle with OAuth2. The main advantage with v3 is the possibility to train the machine to adapt to a specific language pattern, but as mentioned in that separate post, I’m hiding away anything but common English language patterns.

As for authentication, I went for API keys. I don’t need any personalized info, so that’s the simple way to go. To obtain the keys, go to main menu (hamburger icon) > APIs and services > Credentials and pick Create Credentials, and choose to create API keys. Copy the string and use it in the key=API_KEY parameters in POST requests. It’s possible to restrict the usage of this key in various ways (HTTP referrer, IP address etc.) but it wasn’t relevant in my case, because the script runs only on my computer.

The web interface for setting up cloud services is horribly slow, which is slightly ironic and a bit odd for a company like Google.

The translation script

I wrote a simple script for taking a piece of text in English and translating it into the language of choice:

#!/usr/bin/perl

use warnings;
use strict;
use LWP::UserAgent;
use JSON qw[ from_json ];

our $WASTEMONEY = 0; # Prompt before making request
my $MAXLEN = 500000;
my $chars_per_dollar = 50000; # $20 per million characters

our $APIkey = 'your API key here';

my ($outfile, $origfile, $lang) = @ARGV;

die("Usage: $0 outfile origfile langcode\n")
  unless (defined $origfile);

my $input = readfile($origfile);

askuser() unless ($WASTEMONEY);

my $len = length $input;

die("Cowardly refusing to translate $len characters\n")
  if ($len > $MAXLEN);

writefile($outfile, translate($input, $lang));

################## SUBROUTINES ##################

sub writefile {
  my ($fname, $data) = @_;

  open(my $out, ">", $fname)
    or die "Can't open \"$fname\" for write: $!\n";
  binmode($out, ":utf8");
  print $out $data;
  close $out;
}

sub readfile {
  my ($fname) = @_;

  local $/; # Slurp mode

  open(my $in, "<", $fname)
    or die "Can't open $fname for read: $!\n";

  my $input = <$in>;
  close $in;

  return $input;
}

sub askuser {
  my $len = length $input;
  my $cost = sprintf('$%.02f', $len / $chars_per_dollar);

  print "\n\n*** Approval to access Google Translate ***\n";
  print "$len bytes to $lang, $cost\n";
  print "Source file: $origfile\n";
  print "Proceed? [y/N] ";

  my $ans = ;

  die("Aborted due to lack of consent to proceed\n")
    unless ($ans =~ /^y/i);
}

sub translate {
  my ($text, $lang) = @_;

  my $ua = LWP::UserAgent->new;
  my $url = 'https://translation.googleapis.com/language/translate/v2';

  my $res = $ua->post($url,
		      [
		       source => 'en',
		       target => $lang,
		       format => 'html', # Could be 'text'
		       key => $APIkey,
		       q => $text,
		      ]);

  die("Failed to access server: ". $res->status_line . "\n")
    unless ($res->is_success);

  my $data = $res->content;

  my $json = from_json($data, { utf8 => 1 } );

  my $translated;

  eval {
    my $d = $json->{data};
    die("Missing \"data\" entry\n") unless (defined $d);

    my $tr = $d->{translations};
    die("Missing \"translations\" entry\n")
      unless ((defined $tr) && (ref $tr eq 'ARRAY') &&
	     (ref $tr->[0] eq 'HASH'));

    $translated = $tr->[0]->{translatedText};

    die("No translated text\n")
      unless (defined $translated);
  };

  die("Malformed response from server: $@\n") if ($@);

  $translated =~ s/(<\/(?:p|h\d+)>)[ \t\n\r]*/"$1\n"/ge;

  return $translated;
}

The substitution at the end of the translate() function adds a newline after each closing tag for a paragraph or header (e.g.

etc.) so that the HTML is more readable with a text editor. Otherwise it’s all in one single line.

Protecting your money

By obtaining an API key, you effectively give your computer permission to spend money. Which is fine as long as it works as intended, but a plain bug in a script that leads to an infinite loop or recursion, or maybe just feeding the system with a huge file by mistake, can end up with consequences that are well beyond the CPU fan spinning a bit.

So there are two protection mechanisms in the script itself:

The script prompts for permission, stating how much it will cost (based upon $20 / million chars).
It limits a single translation to 500k chars (to avoid a huge file from being processed accidentally).

Another safety mechanism is to set up budgets and budget alerts. Go to Main menu (hamburger) > Billing > Budgets & Alerts. Be sure to check “Email alerts to billing admins and users”. If I got it right, budgets don’t protect against spending, but only sends notifications. So I selected a sum, and enabled only the 100% threshold. It seems to make sense to check all the Discounts and Promotion options in the Credits part, which makes sure that the alert is given for the money to be spent by deducing all promotion credits.

On top of that, it’s a good idea to set quota limits: Go to Main menu (hamburger) > IAM & Admin > Quotas. Set the filter to Translation to get rid of a lot of lines.

It’s also the place to get an accurate figure for the current consumption.

Enable the quota for “v2 and v3 general model characters per day”, which is the only character limit that isn’t per minute, and set it to something sensible, for example 2 million characters if you’re a modest user like myself. That’s $40, which is fairly acceptable damage if the computer goes crazy, and high enough not to hit the roof normally.

Also do something with “v3 batch translation characters using general models per day” and same with AutoML custom models. I don’t use these, so I set both to zero. Just to be safe.

There’s “Edit Quotas” to the top right. Which didn’t work, probably because I did this during the trial period, so quotas are meaningless, and apparently disabled anyhow (or more precisely, enabled to fixed limits).

So the way to do it was somewhat tricky (as it’s probably pointless): To enable a quota, right-click the “Cloud Translation API” to the left of the quota item, and open it in a new tab. Set up the quota figure there. But this description on how to do it might not be accurate for a real-life use. Actually, the system ignored my attempts to impose limits. They appeared on the page for editing them, but not on the main page.

Supporting CJK in LaTeX

I’m wrapping up this post with notes on how to feed LaTeX (pdflatex, more precisely) with Chinese, Japanese and Korean, with UTF-8 encoding, and get a hopefully reasonable result.

So first grab a few packages:

# apt install texlive-lang-european
# apt install texlive-lang-chinese
# apt install texlive-lang-korean
# apt install texlive-cjk-all

Actually, texlive-lang-european isn’t related, but as its name implies, it’s useful for European languages.

I first attempted with

\usepackage[UTF8]{ctex}

but pdflatex failed miserably with an error saying that the fontset ‘fandol’ is unavailable in current mode, whatever that means. After trying a few options back and forth, I eventually went for the rather hacky solution of using CJKutf8. The problem is that CJK chars are allowed only within

\begin{CJK}{UTF8}{gbsn}

[ ... ]

\end{CJK}

but I want it on the whole document, and I need the language setting to be made in a file that is included by the main LaTeX file (a different included file for each language). So I went for this simple hack:

\AtBeginDocument{\begin{CJK}{UTF8}{gbsn}}
\AtEndDocument{\clearpage\end{CJK}}

Note the \clearpage before \end{CJK}. Without it, CJK characters disappear mysteriously from the last entries of the table of contents.

As for the font, it appears like gbsn or gkai fonts should be used with Simplified Chinese, and bsmi or bkai for with Traditional Chinese. Since I translated into Simplified Chinese, some characters just vanished from the output document when trying bsmi and bkai. The back-translation to English of a document made with bsmi was significantly worse, so these dropped characters had a clear impact in intelligibility of the Chinese text.

I got this LaTeX warning saying

LaTeX Font Warning: Some font shapes were not available, defaults substituted.

no matter which of these fonts I chose, so it doesn’t mean much.

So the choice is between gbsn or gkai, but which one? To decide, I copy-pasted Chinese text from updated Chinese websites, and compared the outcome of LaTeX, based upon the TeX file shown below. It was quite clear that gbsn is closer to the fonts in use in these sites, even though I suspect it’s a bit of a Times New Roman: The fonts used on the web have less serifs than gbsn. So gbsn it is, even though it would have been nicer with a font with less serifs.

For Japanese, there’s “min”, “maru” and “goth” fonts. “Min” is a serif font, giving it a traditional look (calligraphy style) and judging from Japanese websites, it appears to be used primarily for logos and formal text (the welcoming words of a university’s president, for example).

“Maru” and “goth” are based upon simple lines, similar to plain text in Japanese websites. The latter is a bit of a bold version of “maru”, but it’s what seems to be popular. So I went with “goth”, which has a clean and simple appearance, similar to the vast majority of Japanese websites, even though the bold of “goth” can get a bit messy with densely drawn characters. It’s just that “maru” looks a bit thin compared to what is commonly preferred.

Korean has two fonts in theory, “mj” and “gt”. “mj” is a serif font with an old fashioned look, and “gt” is once again the plain, gothic version. I first failed to use the “gt” font even though it was clearly installed (there were a lot of files in the same directories as where the “mj” files were installed, only with “gt”). Nevertheless, trying the “gt” font instead of “mj” failed with

LaTeX Font Warning: Font shape `C70/gt/m/it' undefined
(Font)              using `C70/song/m/n' instead on input line 8.

! Undefined control sequence.
try@size@range ...extract@rangefontinfo font@info
                                                  <-*>@nil <@nnil

But as it turns out, it should be referred to as “nanumgt”, e.g.

\begin{CJK}{UTF8}{nanumgt}
나는 멋진 글꼴을 원한다
\end{CJK}

It’s worth mentioning XeLaTeX, which allows using an arbitrary True Type font withing LaTeX, so the font selection is less limited.

See this page on fonts in Japanese and Korean.

For these tests, I used the following LaTeX file for use with e.g.

$ pdflatex test.tex

\documentclass{hitec}
\usepackage[utf8]{inputenc}
\usepackage[T1]{fontenc}
\usepackage{CJKutf8}
\newcommand{\thetext}
{

它说什么并不重要，重要的是它是如何写的。
}

\AtBeginDocument{}
\AtEndDocument{}
\title{This document}
\begin{document}

gbsn:

\begin{CJK}{UTF8}{gbsn}
\thetext
\end{CJK}

gkai:

\begin{CJK}{UTF8}{gkai}
\thetext
\end{CJK}

bsmi:

\begin{CJK}{UTF8}{bsmi}
\thetext
\end{CJK}

bkai:

\begin{CJK}{UTF8}{bkai}
\thetext
\end{CJK}

\end{document}