my tech blog » MySQL

Perl + DBI: Measuring the time of database queries

eli — Wed, 06 Jan 2021 11:00:00 +0000

It’s often desired to know how much wall clock time the SQL queries take. As with Perl, there’s more than one way to do it. This is a simple way, which involves overriding DBI’s execute() method, so it measures the time and issues a warn() with basic caller info and the time in milliseconds.

The same thing can be done with any other method, of course. Go “man DBI” and look for “Subclassing the DBI” for the full story.

So the first thing is to define the RootClass property on the DBI object generation, so that MySubDBI is the root class. Something like

$dbh = DBI->connect( "DBI:mysql::localhost", "thedatabase", "thepassword",
		     { RaiseError => 1, AutoCommit => 1, PrintError => 0,
		       RootClass => 'MySubDBI',
		       Taint => 1});

and then, somewhere, the class needs to be defined. This can be in a separate module .pm file, but also at the end of the same file as the code for the DBI->connect:

package MySubDBI;
use strict;

use DBI;
use vars qw(@ISA);
@ISA = qw(DBI);

package MySubDBI::db;
use vars qw(@ISA);
@ISA = qw(DBI::db);

# Empty, yet necessary.

package MySubDBI::st;
use Time::HiRes qw(gettimeofday tv_interval);
use vars qw(@ISA);
@ISA = qw(DBI::st);

sub execute {
  my ($sth, @args) = @_;
  my $tic = [gettimeofday];
  my $res = $sth->SUPER::execute(@args);
  my $exectime = int(tv_interval($tic)*1000);

  my ($package0, $file0, $line0, $subroutine0) = caller(0);
  my ($package1, $file1, $line1, $subroutine1) = caller(1);

  warn("execute() call from $subroutine1 (line $line0) took $exectime ms\n");
  return $res;
}

1;

The code can be smarter, of course. For example, issue a warning only if the query time exceeds a certain limit.

MySQL: LIKE vs. REGEXP() performance testing

eli — Tue, 29 Dec 2020 09:38:11 +0000

Introduction

At some point I needed to choose between using LIKE or REGEXP() for a not so simple string match. Without going into the details, the matching string contains a lot of wildcard segments, and while both would have done the job, I thought maybe REGEXP() would benefit from some extra information about the wildcard parts. It’s not that I cared that ‘%’ would match characters it shouldn’t, but I wanted so save some backtracking by telling the matching engine not to match just anything. Save some CPU, I thought. Spoiler: It was a nice thought, but no.

So I ran a few performance tests on a sample table:

CREATE TABLE product_info (
	cpe_name BLOB NOT NULL,
	title BLOB NOT NULL,
	PRIMARY KEY (cpe_name(511))
) engine = MyISAM;

Note that cpe_name is the primary key, and is hence a unique index.

I should also mention that the match patterns I use are for testing are practically useless for CPE name matching because they don’t handle escaped “:” characters properly. Just in case you know what a CPE is and you’re here for that. In short, these are just performance tests.

I did this on MySQL server version: 5.1.47, Source distribution. There are newer versions around, I know. Maybe they do better.

The art of silly queries

So there’s about more than half a million entries:

mysql> SELECT COUNT(*) FROM product_info;
+----------+
| COUNT(*) |
+----------+
|   588094 |
+----------+
1 row in set (0.00 sec)

Let’s ask it another way:

mysql> SELECT COUNT(*) FROM product_info WHERE cpe_name LIKE '%';
+----------+
| COUNT(*) |
+----------+
|   588094 |
+----------+
1 row in set (0.08 sec)

Say what? LIKE ‘%’ is always true for a non-NULL BLOB. MySQL didn’t optimize this simple thing, and actually checked every single entry?

mysql> EXPLAIN SELECT COUNT(*) FROM product_info;
+----+-------------+-------+------+---------------+------+---------+------+------+------------------------------+
| id | select_type | table | type | possible_keys | key  | key_len | ref  | rows | Extra                        |
+----+-------------+-------+------+---------------+------+---------+------+------+------------------------------+
|  1 | SIMPLE      | NULL  | NULL | NULL          | NULL | NULL    | NULL | NULL | Select tables optimized away |
+----+-------------+-------+------+---------------+------+---------+------+------+------------------------------+
1 row in set (0.00 sec)

mysql> EXPLAIN SELECT COUNT(*) FROM product_info WHERE cpe_name LIKE '%';
+----+-------------+--------------+------+---------------+------+---------+------+--------+-------------+
| id | select_type | table        | type | possible_keys | key  | key_len | ref  | rows   | Extra       |
+----+-------------+--------------+------+---------------+------+---------+------+--------+-------------+
|  1 | SIMPLE      | product_info | ALL  | NULL          | NULL | NULL    | NULL | 588094 | Using where |
+----+-------------+--------------+------+---------------+------+---------+------+--------+-------------+
1 row in set (0.00 sec)

Apparently it did. Other silly stuff:

mysql> SELECT COUNT(*) FROM product_info WHERE NOT cpe_name IS NULL;
+----------+
| COUNT(*) |
+----------+
|   588094 |
+----------+
1 row in set (0.08 sec)

mysql> EXPLAIN SELECT COUNT(*) FROM product_info WHERE NOT cpe_name IS NULL;
+----+-------------+--------------+------+---------------+------+---------+------+--------+-------------+
| id | select_type | table        | type | possible_keys | key  | key_len | ref  | rows   | Extra       |
+----+-------------+--------------+------+---------------+------+---------+------+--------+-------------+
|  1 | SIMPLE      | product_info | ALL  | PRIMARY       | NULL | NULL    | NULL | 588094 | Using where |
+----+-------------+--------------+------+---------------+------+---------+------+--------+-------------+
1 row in set (0.00 sec)

Silly or what? The column is defined as “NOT NULL”. What is there to check? So maybe the idea is that if I make stupid queries, MySQL responds with stupid behavior. Well, not really:

mysql> SELECT COUNT(*) FROM product_info WHERE 1=1;
+----------+
| COUNT(*) |
+----------+
|   588094 |
+----------+
1 row in set (0.00 sec)

mysql> EXPLAIN SELECT COUNT(*) FROM product_info WHERE 1=1;
+----+-------------+-------+------+---------------+------+---------+------+------+------------------------------+
| id | select_type | table | type | possible_keys | key  | key_len | ref  | rows | Extra                        |
+----+-------------+-------+------+---------------+------+---------+------+------+------------------------------+
|  1 | SIMPLE      | NULL  | NULL | NULL          | NULL | NULL    | NULL | NULL | Select tables optimized away |
+----+-------------+-------+------+---------------+------+---------+------+------+------------------------------+
1 row in set (0.00 sec)

It’s more like MySQL makes optimizations only if they’re really obvious. MySQL != gcc.

LIKE vs REGEXP

So it turns out that LIKE is really fast, and it seems to take advantage of the index:

mysql> SELECT cpe_name FROM product_info WHERE cpe_name LIKE 'cpe:2.3:a:hummingbird:cyberdocs:-%';
+-------------------------------------------------+
| cpe_name                                        |
+-------------------------------------------------+
| cpe:2.3:a:hummingbird:cyberdocs:-:*:*:*:*:*:*:* |
+-------------------------------------------------+
1 row in set (0.00 sec)

The same query, only with a regex:

mysql> SELECT cpe_name FROM product_info WHERE cpe_name REGEXP '^cpe:2.3:a:hummingbird:cyberdocs:-.*';
+-------------------------------------------------+
| cpe_name                                        |
+-------------------------------------------------+
| cpe:2.3:a:hummingbird:cyberdocs:-:*:*:*:*:*:*:* |
+-------------------------------------------------+
1 row in set (0.21 sec)

Recall that indexing means that the rows are sorted. Because the initial part of the string is fixed, it’s possible to narrow down the number of rows to match with an index lookup, and then work on those that match that initial part. This was taken advantage of with the LIKE match, but apparently not with REGEXP():

mysql> EXPLAIN SELECT cpe_name FROM product_info WHERE cpe_name LIKE 'cpe:2.3:a:hummingbird:cyberdocs:-%';
+----+-------------+--------------+-------+---------------+---------+---------+------+------+-------------+
| id | select_type | table        | type  | possible_keys | key     | key_len | ref  | rows | Extra       |
+----+-------------+--------------+-------+---------------+---------+---------+------+------+-------------+
|  1 | SIMPLE      | product_info | range | PRIMARY       | PRIMARY | 513     | NULL |    1 | Using where |
+----+-------------+--------------+-------+---------------+---------+---------+------+------+-------------+
1 row in set (0.00 sec)

mysql> EXPLAIN SELECT cpe_name FROM product_info WHERE cpe_name REGEXP '^cpe:2.3:a:hummingbird:cyberdocs:-.*';
+----+-------------+--------------+------+---------------+------+---------+------+--------+-------------+
| id | select_type | table        | type | possible_keys | key  | key_len | ref  | rows   | Extra       |
+----+-------------+--------------+------+---------------+------+---------+------+--------+-------------+
|  1 | SIMPLE      | product_info | ALL  | NULL          | NULL | NULL    | NULL | 588094 | Using where |
+----+-------------+--------------+------+---------------+------+---------+------+--------+-------------+
1 row in set (0.00 sec)

The conclusion is quite clear: For heavy duty matching, don’t use REGEXP() if LIKE can do the job. In particular if a lot of rows can be ruled out by virtue of the first characters in the string.

Making it harder for LIKE

Let’s warm up a bit:

mysql> SELECT cpe_name FROM product_info WHERE cpe_name LIKE 'cpe:2.3:a:humming%:cyberdocs:-%';
+-------------------------------------------------+
| cpe_name                                        |
+-------------------------------------------------+
| cpe:2.3:a:hummingbird:cyberdocs:-:*:*:*:*:*:*:* |
+-------------------------------------------------+
1 row in set (0.00 sec)

It was quicker than measurable, mainly because there was little to do:

mysql> EXPLAIN SELECT cpe_name FROM product_info WHERE cpe_name LIKE 'cpe:2.3:a:humming%:cyberdocs:-%';
+----+-------------+--------------+-------+---------------+---------+---------+------+------+-------------+
| id | select_type | table        | type  | possible_keys | key     | key_len | ref  | rows | Extra       |
+----+-------------+--------------+-------+---------------+---------+---------+------+------+-------------+
|  1 | SIMPLE      | product_info | range | PRIMARY       | PRIMARY | 513     | NULL |   80 | Using where |
+----+-------------+--------------+-------+---------------+---------+---------+------+------+-------------+
1 row in set (0.00 sec)

As expected, the index was used for the initial part of the string, and that left MySQL with 80 rows to actually do the matching (in this specific case). So it was quick.

But what if the wildcard is at the beginning of the string?

mysql> SELECT cpe_name FROM product_info WHERE cpe_name LIKE '%cpe:2.3:a:hummingbird:cyberdocs:-%';
+-------------------------------------------------+
| cpe_name                                        |
+-------------------------------------------------+
| cpe:2.3:a:hummingbird:cyberdocs:-:*:*:*:*:*:*:* |
+-------------------------------------------------+
1 row in set (0.10 sec)

So that took some considerable time, however still less than the REXEXP() case. The index didn’t help in this case, it seems:

mysql> EXPLAIN SELECT cpe_name FROM product_info WHERE cpe_name LIKE '%cpe:2.3:a:hummingbird:cyberdocs:-%';
+----+-------------+--------------+------+---------------+------+---------+------+--------+-------------+
| id | select_type | table        | type | possible_keys | key  | key_len | ref  | rows   | Extra       |
+----+-------------+--------------+------+---------------+------+---------+------+--------+-------------+
|  1 | SIMPLE      | product_info | ALL  | NULL          | NULL | NULL    | NULL | 588094 | Using where |
+----+-------------+--------------+------+---------------+------+---------+------+--------+-------------+
1 row in set (0.00 sec)

So apparently, the pattern was applied to all rows. It was still faster than REGEXP() making a very simple match. So the latter seems not to be optimized in MySQL.

And to wrap this up, an expression full with wildcards. Similar to the one I thought maybe REXEXP would do better:

mysql> SELECT COUNT(*) FROM product_info WHERE cpe_name LIKE 'cpe:2.3:a:%:%:%:-:%:%:%:%:%:%';
+----------+
| COUNT(*) |
+----------+
|    13205 |
+----------+
1 row in set (0.11 sec)

mysql> EXPLAIN SELECT COUNT(*) FROM product_info WHERE cpe_name LIKE 'cpe:2.3:a:%:%:%:-:%:%:%:%:%:%';
+----+-------------+--------------+------+---------------+------+---------+------+--------+-------------+
| id | select_type | table        | type | possible_keys | key  | key_len | ref  | rows   | Extra       |
+----+-------------+--------------+------+---------------+------+---------+------+--------+-------------+
|  1 | SIMPLE      | product_info | ALL  | PRIMARY       | NULL | NULL    | NULL | 588094 | Using where |
+----+-------------+--------------+------+---------------+------+---------+------+--------+-------------+
1 row in set (0.00 sec)

So what happened here is that the initial part of the string didn’t help much, and the match seems to have been done on all rows. It took more or less the same time as the much simpler match pattern above.

Conclusion

There seems to be two main conclusions from this little set of experiments: The first one isn’t surprising: The most important factor is how many rows are being accessed, not so much what is done with them. And the second is that MySQL does some sensible optimizations when LIKE is used, in particular it narrows down the number of rows with the index, when possible. Something it won’t do with REGEXP(), even if with an “^” at the beginning of the regex.

MySQL: Enforcing long unique BLOB / TEXT columns with UNIQUE INDEX

eli — Tue, 22 Dec 2020 04:30:27 +0000

Introduction

When creating indexes to TEXT / BLOB columns (and their variants), it’s required to specify how many characters the index should cover. In MySQL’s docs it’s usually suggested to keep them short in for better performance. There’s also a limit on the number of characters, which varies from one database engine to another, going from a few hundreds to a few thousands characters.

However it’s not unusual to use UNIQUE INDEX for making the database enforce the uniqueness of a field in the table. ON DUPLICATE KEY, INSERT IGNORE, UPDATE IGNORE and REPLACE can then be used to gracefully keep things tidy.

But does UNIQUE INDEX mean that the entire field remains unique, or is only the part covered by the index checked? Spoiler: Only the part covered by the index. In other words, MySQL’s uniqueness enforcement may be too strict.

Almost needless to say, if the index isn’t UNIQUE, it’s just a performance issue: If the number of covered characters is small, the database will more often fetch and discard the data of a row that turns out not to match the search criteria. But it doesn’t change the logic behavior. With UNIQUE INDEX, the number of characters does.

A simple test follows.

Trying the UNIQUE INDEX

I created a table with MySQL 5.1.47, with an index covering only 5 chars. Deliberately very short.

CREATE TABLE try (
	id INT UNSIGNED NOT NULL AUTO_INCREMENT,
	message BLOB NOT NULL,
	PRIMARY KEY (id),
	UNIQUE INDEX (message(5))
) engine = MyISAM;

Which ends up with this:

mysql> DESCRIBE try;
+---------+------------------+------+-----+---------+----------------+
| Field   | Type             | Null | Key | Default | Extra          |
+---------+------------------+------+-----+---------+----------------+
| id      | int(10) unsigned | NO   | PRI | NULL    | auto_increment |
| message | blob             | NO   | UNI | NULL    |                |
+---------+------------------+------+-----+---------+----------------+
2 rows in set (0.00 sec)

Inserting the first row:

mysql> INSERT INTO try(message) VALUES('Hello, world');
Query OK, 1 row affected (0.00 sec)

And now trying a second:

mysql> INSERT INTO try(message) VALUES('Hello there');
ERROR 1062 (23000): Duplicate entry 'Hello' for key 'message'

That’s it. It just looked at the first five chars. Trying a difference within this region:

mysql> INSERT INTO try(message) VALUES('Hell there');
Query OK, 1 row affected (0.00 sec)

No wonder, that worked.

mysql> SELECT * FROM try;
+----+--------------+
| id | message      |
+----+--------------+
|  1 | Hello, world |
|  2 | Hell there   |
+----+--------------+
2 rows in set (0.00 sec)

Handling large TEXT/BLOB

And that leaves the question: How is it possible to ensure uniqueness on large chunks of text or binary data? One solution I can think of is to add a column for a hash (say SHA1), and let the application calculate that hash for each row it inserts, and insert it along with it. And make the UNIQUE INDEX on the hash, not the text. Something like

CREATE TABLE try2 (
	id INT UNSIGNED NOT NULL AUTO_INCREMENT,
	message BLOB NOT NULL,
	hash TINYBLOB NOT NULL,
	PRIMARY KEY (id),
	UNIQUE INDEX (hash(40))
) engine = MyISAM;

But wait. MySQL supports hashing functions. Why not use them instead? Well, the problem is that if I want an INSERT statement push the data and its hash in one go they query becomes a bit nasty. What came to my mind is:

mysql> INSERT INTO try2(message, hash) (SELECT t.x, SHA1(t.x) FROM (SELECT 'Hello there' AS x) AS t);
Query OK, 1 row affected (0.00 sec)
Records: 1  Duplicates: 0  Warnings: 0

Can it get more concise than this? Suggestions are welcome. The double SELECT is required because I want the string literal to be mentioned once.

Isn’t it easier to let the application calculate the SHA1, and send it to the server by value? It’s a matter of taste, I guess.

Anyhow, trying again with exactly the same:

mysql> INSERT INTO try2(message, hash) (SELECT t.x, SHA1(t.x) FROM (SELECT 'Hello there' AS x) AS t);
ERROR 1062 (23000): Duplicate entry '726c76553e1a3fdea29134f36e6af2ea05ec5cce' for key 'hash'

and with something slightly different:

mysql> INSERT INTO try2(message, hash) (SELECT t.x, SHA1(t.x) FROM (SELECT 'Hello there!' AS x) AS t);
Query OK, 1 row affected (0.00 sec)
Records: 1  Duplicates: 0  Warnings: 0

So yep, it works:

mysql> SELECT * FROM try2;
+----+--------------+------------------------------------------+
| id | message      | hash                                     |
+----+--------------+------------------------------------------+
|  1 | Hello there  | 726c76553e1a3fdea29134f36e6af2ea05ec5cce |
|  2 | Hello there! | 6b19cb3790b6da8f7c34b4d8895d78a56d078624 |
+----+--------------+------------------------------------------+
2 rows in set (0.00 sec)

Once again, even though the example I showed demonstrates how to make MySQL calculate the hash, I would do it in the application.

Perl, DBI and MySQL wrongly reads zeros from database

eli — Fri, 15 Feb 2019 15:29:37 +0000

TL;DR: SELECT queries in Perl for numerical columns suddenly turned to zeros after a software upgrade.

This is a really peculiar problem I had after my web hosting provider upgraded some database related software on the server: Numbers that were read with SELECT queries from the database were suddenly all zeros.

Spoiler: It’s about running Perl in Taint Mode.

The setting was DBD::mysql version 4.050, DBI version 1.642, Perl v5.10.1, and MySQL Community Server version 5.7.25 on a Linux machine.

For example, the following script is supposed to write the number of lines in the “session” table:

#!/usr/bin/perl -T -w
use warnings;
use strict;
require DBI;

my $dbh = DBI->connect( "DBI:mysql:mydb:localhost", "mydb", "password",
		     { RaiseError => 1, AutoCommit => 1, PrintError => 0,
		       Taint => 1});

my $sth = $dbh->prepare("SELECT COUNT(*) FROM session");

$sth->execute();

my @l = $sth->fetchrow_array;
my $s = $l[0];
print "$s\n";

$sth->finish();
$dbh->disconnect;

But instead, it prints zero, even though there are rows in the said table. Turning off taint mode by removing the “-T” flag in the shebang line gives the correct output. Needless to say, accessing the database with the “mysql” command-line utility client gave the correct output as well.

This is true for any numeric readout from this MySQL wrapper. This is in particular problematic when an integer is used as a user ID of a web site, and fetched with

my $sth = db::prepare_cached("SELECT id FROM users WHERE username=? AND passwd=?");
$sth->execute($name, $password);
my ($uid) = $sth->fetchrow_array;
$sth->finish();

If the credentials are wrong, $uid will be undef, as usual. But if any valid user gives correct credentials, it’s allocated user number 0. Which I was cautious enough not to allocate as the site’s supervisor, but that’s actually a common choice (what’s the UID of root on a Linux system?).

A softer workaround, instead of dropping the “-T” flag, is to set the TaintIn flag in the DBI->connect() call, instead of Taint. The latter stands for TaintIn and TaintOut, and so this fix effectively disables TaintOut, hence tainting of data from the database is disabled. And in this case, disabling tainting of this data also skips the zero-value bug. This leaves all other tainting checks in place, in particular that of data supplied from the network. So not enforcing sanitizing data from the database is a small sacrifice (in particular if the script already has mileage running with the enforcement on).

And in the end I wonder if I’m the only one who uses Perl’s tainting mechanism. I mean, if there are still (2019) advisories on SQL injections (mostly PHP scripts), maybe people just don’t care much about things of this sort.

Why MySQL’s (SQL) DATETIME can and should be avoided

eli — Sun, 15 Mar 2009 14:58:24 +0000

I warmly recommend reading the comments at the bottom of this page, many of which go against my point. While I still stand behind every word I said, in particular for web applications (which I believe is the vast majority of MySQL use), the comments below make some valid points, and single out cases where DATETIME actually is the right thing.

Needless to say, this is a discussion, and we’re all free to make our own mistakes.

SQL DATETIME sucks

MySQL, among other databases, has a column type called DATETIME. Its name seems to mislead people into thinking that it’s suitable for storing time of events. Or suitable for anything.

This is a general SQL thing, by the way, but I’ll demonstrate it on MySQL.

I often find this column type in other people’s database schemas, and I wonder if the designer gave it a thought before using it. It’s true, that in the beginning it looks simple:

mysql> CREATE TABLE stupid_date ( thedate DATETIME, PRIMARY KEY (thedate) );
Query OK, 0 rows affected (0.04 sec)

mysql> INSERT INTO stupid_date(thedate) VALUES ( NOW() );
Query OK, 1 row affected (0.03 sec)

mysql> SELECT * FROM stupid_date;
+---------------------+
| thedate             |
+---------------------+
| 2009-03-15 14:01:43 |
+---------------------+
1 row in set (0.00 sec)

That was way too cute, wasn’t it? We also have the NOW() function, which fits in exactly, and puts the current time! Yay! Yay! And if the timestamp looks old-fashioned to you, I suppose there is a reason for that.

But wait, there are two major problems. The first one is that the time is given in the host’s local time. That was fair enough before the internet was invented. But today a web server can be a continent away. DATETIME will show you the local time of the server, not yours. There are SQL functions to convert timezones, of course. Are you sure that you want to deal with them? What happens when you want to move your database to a server in another timezone? What about daylight saving time? Local time is one big YUCK.

(Update: As commented below, the real stupidity is to use NOW(), and not UTC_TIMESTAMP(). The latter gives the UTC time, as its name implies)

Problem number two: Most applications don’t care what the absolute time is. The current time is commonly used to calculate how much time has elapsed since a certain event. To filter elements according to if they were X seconds before now. Is the user still logged in? Has 24 hours elapsed since the last warning email was sent? And so on.

“Solution”: The SQL language supplies a variety of COBOL-like functions to calculate whatever we can ask for. And also an opportunity to get things horribly wrong, because the SQL statement became way too complicated.

Use POSIX time() instead

Sorry, didn’t mean to scare you off. It’s really simple: Any modern operating system, even Windows, will readily supply you with the number of seconds since January 1, 1970, midnight, UTC (that is, more or less GMT). This is also called “seconds since the Epoch” or “UNIX time”.

No matter where the computer is, what timezone it uses or what programming language you’re using, this simple integer representation will show the same number at any given moment.

You can, in fact, obtain this number from MySQL directly:

mysql> SELECT UNIX_TIMESTAMP(thedate) FROM stupid_date;
+-------------------------+
| UNIX_TIMESTAMP(thedate) |
+-------------------------+
|              1237118503 |
+-------------------------+
1 row in set (0.00 sec)

This means, that 1237118503 seconds elapsed since the Epoch (which is a global time point) until 14:01:43 in Israeli LOCAL time of the day I wrote this post. So now we have an integer number to work with, which is handy for calculations, but things will still get messy if we try to move the database to another server.

Store the number instead

If we are interested in working with integers, why not store the integer itself in the database? We could go:

mysql> CREATE TABLE smart_date ( thedate INTEGER UNSIGNED, PRIMARY KEY (thedate) );
Query OK, 0 rows affected (0.00 sec)

mysql> INSERT INTO smart_date(thedate) VALUES (1237118503);
Query OK, 1 row affected (0.00 sec)

mysql> SELECT * FROM smart_date;
+------------+
| thedate    |
+------------+
| 1237118503 |
+------------+
1 row in set (0.00 sec)

That wasn’t very impressive, was it? The first question would be “OK, how do I get this magic number, now that I don’t have the NOW() function?”

The short and not-so-clever answer is that you could always use MySQL’s UNIX_TIMESTAMP( NOW() ) for this. The better answer is that no matter which scripting or programming language you’re using, this number is very easy to obtain. I’ll show examples below.

As for the magnitude of this number, yes, it’s pretty big. But it will fit a signed 32-bit integer until year 2038. I presume that nobody will use 32-bit integers by then.

And finally, one could argue that DATETIME is convenient when reading from the database directly. True. But for that specific issue we have the FROM_UNIXTIME() function:

mysql> SELECT FROM_UNIXTIME(thedate) FROM smart_date;
+------------------------+
| FROM_UNIXTIME(thedate) |
+------------------------+
| 2009-03-15 14:01:43    |
+------------------------+
1 row in set (0.00 sec)

And again, this is given in the computer’s local time. Which is fine, because it’s intended to be read by humans. In particular, humans who easily translate time differences between their server and themselves.

Obtaining Epoch time

Just to prove that it’s easy to know what the “Epoch time” is in any language, here are a few examples. Wherever it’s really simple, I’m showing how to convert this format to human-readable format.

In Perl:

print time();
print scalar localtime time(); # Local time for humans

In PHP:

In Python:

from time import time;
print time();

(note that the time is returned as a float number with higher precision)

In C:

#include 
#include 

int main () {
  int now = time(NULL);

  printf("%d seconds since the Epoch\n", now);
  return 0;
}

In JavaScript:

In this case, the time is shown with a fractional resolution.

The JavaScript example is not really useful for a database application, because the time is measured at the computer showing the page. In a website application, this is just anybody’s computer clock, which may be wrong. But it’s yet another example of how this time representation is widely available.

Conclusion

Drop those DATETIME columns from your tables, and use a simple, robust and handy format to represent time. Don’t let the database play around with a sensitive issue like time, and don’t risk getting confused by different functions when calculating time differences. Just because the DATETIME column type exists, it doesn’t mean there is a reason to use it.

Enjoy the database on what it’s best at: Storing and collecting information.

BLOB, TEXT, and case sensitivity: MySQL won’t treat them the same

eli — Mon, 02 Mar 2009 09:59:51 +0000

When I first discovered that there is both BLOB and TEXT in databases, I was puzzled. They occupy the same amount of disk space, and the database doesn’t alter the data itself anyhow. Why two?

Of course, I followed the stream, and went for TEXT and VARCHAR for everything, since I don’t store binary data in the database. That may not be the optimal choice in all cases.

It turns out, that MySQL goes a long way to “help” the user with string operations. In particular, when the table column is defined as TEXT, VARCHAR and their derivatives, the database will compare strings as they would be understood by humans. And if that definition sounds ambiguous to you, you’re in good company: Different versions of MySQL compare text strings differently. For example, snipping leading and trailing whitespaces from the strings before comparing them: Some versions will do this, others won’t.

The bottom line is that if you change your MySQL server, tiny bugs may creep in. All these corner cases may behave differently. This is the classic case of some entry disappearing from a list of 53478, without anyone noticing.

Another issue to consider is character collation: When using TEXT, VARCHAR and friends, we also have that ‘Ö’ and ‘OE’ are treated as the same character (when German character set is used, among others). Just an example.

The solution: Make MySQL treat your data as a binary. A BLOB column type, for example. You loose all those extra “features”, but gain stability over database versions and flavors. That means that you have to handle all the case-insensitivity issues yourself, as well cleaning up the strings properly. With good practices, that’s not an issue. It’s a matter of if you want to take responsibility, or let the database iron those small wrinkles for you.

And finally: What about uniqueness in tables? Here’s a short session, using MySQL 4.0.24:

mysql> CREATE TABLE try (mydata TEXT, UNIQUE INDEX (mydata(20)) );
Query OK, 0 rows affected (0.00 sec)

mysql> INSERT INTO try(mydata) VALUES('Hello');
Query OK, 1 row affected (0.00 sec)

mysql> SELECT * FROM try WHERE mydata='hELLO';
+--------+
| mydata |
+--------+
| Hello  |
+--------+
1 row in set (0.00 sec)

This was pretty much expected: As a TEXT column, the comparison was case-insensitive. And of course, the capital “H” was saved in the table, even though that doesn’t matter in string comparisons.

But what happens if we want to add an entry, which violates the uniqueness, when considering the strings in a case-insensitive manner?

mysql> INSERT INTO try(mydata) VALUES('HELLO');
ERROR 1062: Duplicate entry 'HELLO' for key 1

As expected, MySQL didn’t swallow this. “Hello” and “HELLO” are the same, so they can’t live together when “mydata” is restricted as UNIQUE.

So let’s drop the table, and try this again, this time with a BLOB column. Spoiler: Everything is case-sensitive now.

mysql> DROP TABLE try;
Query OK, 0 rows affected (0.00 sec)

mysql> CREATE TABLE try (mydata BLOB, UNIQUE INDEX (mydata(20)) );
Query OK, 0 rows affected (0.01 sec)

mysql> INSERT INTO try(mydata) VALUES('Hello');
Query OK, 1 row affected (0.00 sec)

mysql> SELECT * FROM try WHERE mydata='hELLO';
Empty set (0.00 sec)

(why should it find anything? ‘Hello’ and ‘hELLO’ are completely different!)

mysql> SELECT * FROM try;
+--------+
| mydata |
+--------+
| Hello  |
+--------+
1 row in set (0.00 sec)

mysql> INSERT INTO try(mydata) VALUES('HELLO');
Query OK, 1 row affected (0.00 sec)

mysql> SELECT * FROM try;
+--------+
| mydata |
+--------+
| Hello  |
| HELLO  |
+--------+
2 rows in set (0.01 sec)

(No problems with the uniqueness: ‘Hello’ and ‘HELLO’ are not the same in a BLOB)

To summarize all this: Before choosing between TEXT or BLOB, ask yourself if you want the database to treat the string exactly as it is, or if you want some forgiveness regarding case, whitespaces and natural language issues.

For example, are you sure that you want the user name and password as text? In particular, would you like the password case-insensitive? Do you want HTTP links as text? The address itself is indeed case-insensitive to the web, but CGI arguments (everything after the question mark, if present) is case-sensitive (YouTube video IDs, for example).

Usually, using a text column is OK. But it’s a choice one has to make.