Saving Bytes: Efficient Data Storage (MySQL) - Part 1

I Like It!

Hey there! Welcome to Making the Web - my personal blog about website development. Feel free to subscribe to my RSS feed to keep up with the latest. Alternatively, subscribe by email. Hope you enjoy this article!

When storing data in databases, it is important to ensure the data is stored using a minimal amount of space, while retaining its value and processing efficiency.

There are many common practices amongst MySQL developers who write applications which store data in a human-interpretable way; they often don't think about the computer's interpretation of the data, and often fail to realise that it can be more efficient and space-saving to store data in a certain form.

It is especially important to consider data storage techniques in high-demand applications, and where speed and storage-efficiency are key values. Web services require these key values: it is important that they can cope with extra demand, and can satisfy their user's desire for immediacy.

MySQL DataTypes
(Screenshot from phpMyAdmin)

In the first part of Saving Bytes: Efficient Data Storage, we will look at the storage of strings.

Storing Strings

With so many different data types available in database applications, developers have so much choice. But, it is because of this vast range of choice that a developer will often choose a type which is not suitable for the field's purpose. In MySQL, eight different string types are available: CHAR, VARCHAR, BINARY, VARBINARY, BLOB, TEXT, ENUM, and SET.

CHAR and VARCHAR

If you're storing small non-binary data, then it is probably best to use CHAR or VARCHAR. But which? Well, the answer is revealed when you understand the two types: a CHAR field has a fixed-length, defined when creating the table; a VARCHAR field has a variable-length.

Because VARCHAR is variable-length, we must use an extra byte to say how large the string is (or 2 bytes if there is more than 255 bytes of data). With CHAR, only the characters in the string are used, but, a string less in size than defined will be padded to ensure it fits.

When the information is the same length, time and time again, you should use CHAR; use for: fixed-length hashes, 2-character country/area codes, same length phone numbers, etc. You may have considered using a CHAR field to store a MD5 hash in hexadecimal form, but we will learn later that it is more space-efficient to use a BINARY field instead.

BINARY and VARBINARY

The BINARY and VARBINARY data types are similar to the CHAR and VARCHAR types. The major difference is that these types store binary data. Basically, these types work with no character set - they are stored as bytes rather than characters.

Like CHAR, BINARY fields are padded until the value is the correct, fixed length. With VARBINARY, a byte is reserved to define the field's length.

It is better to use VARBINARY when the data is not definitely fixed-length. This is because the padding doesn't change the value. When the data is fixed-length, though, BINARY should be used.

Many developers store MD5 hashes in form of a 32-character-long hexadecimal string in a CHAR field (or sometimes even VARCHAR). But, there is a better was of storing the data. Because each character can have one of 16 values, 240 values in each byte are never used. But, there is a way to utilise this space, and cut down the storage space by half:

Data is stored in binary form like this: 00110011. This is a total of eight bits, and therefore has 28 = 256 possible values. But, what would happen if we were to split that binary in to two parts? This: 0011 0011. Now, there is two 4-bit parts, each with 24 = 16 possible values. Taking advantage of this, we can store the first value in the first part, second in the second part, and then join them together to make one byte. In PHP 5, the option to output the raw data was added to the MD5 function.

BLOB and TEXT

BLOB is like a VARBINARY field in that it stores binary data. BLOB, though, can store much more data. TEXT fields are like BLOB fields but, they store characters rather than byte data.

There aren't many uses for the BLOB type. One possible use is to store uploaded files, rather than creating a physical file. This may be useful to certain applications, because information can be stored with the file.

The TEXT type is more common. It is sometimes used to store text for a page, as part of a content-management system. WordPress, for example, uses a LONGTEXT field to store articles.

There are a few variations of the TEXT and BLOB types which affect size; they are:

Type Maximum Length Storage
TINYBLOB, TINYTEXT 255 Length+1 bytes
BLOB, TEXT 65535 Length+2 bytes
MEDIUMBLOB, MEDIUMTEXT 16777215 Length+3 bytes
LONGBLOB, LONGTEXT 4294967295 Length+4 bytes

Most of the time, it is best to consider the maximum length of the fields rather than the storage requirements. There is not much of a point for most databases to make a decision between MEDIUM- and LONG- BLOB, because 1 extra byte amongst the other 16777215 is not going to make much of a difference.

ENUM and SET

The ENUM type allows a field to have numeric representations of string values. For example, if you declare ENUM to allow the values: ('value1', 'value2', 'value3'), then when storing value1, the database will use the numeric representation 1; value2, 2; value3, 3, etc.

It is sometimes very useful to use the ENUM type. For example, when a user fills out a form, and chooses from three options: "Good", "Bad", "OK", the data can be stored with numeric representation instead of in a VARCHAR field. Each value will only require 1 byte, instead of the previous LENGTH+1 bytes required for VARCHAR.

The SET type is very rarely used. Like, ENUM, it allows you to specify allowed values in a list. But, the fields can hold "listed" values themselves. For example, if the allowed values are ('value1', 'value2'), then the field accepts the following values:

"
'value1'
'value2'
'value1,value2'

To be honest, until writing this article, I had only ever heard of SET, with no idea what it actually done. I can only think of few examples where the type would be used. It could possibly be used in an advertising system, where the advertiser can select multiple target languages.

Scenarios

Below, I have listed some scenarios for storing string values. See if you can find the best data types and storage methods:

Scenario Best method (for space-efficiency)
Storing password in form of an MD5 hash Store in binary form in BINARY(16) field.
Storing a user's country Store 2-byte country code in CHAR(2) field
Storing a UK Postal Code (e.g. AA112BB) Store in CHAR(7) field.
OR As it is alphanumeric, base64_decode it and store in BINARY(5). Then, base64_encode when retrieving (and remove non-alphanumeric characters [=])
Storing a user's email address Store in TEXT field - email addresses can be 320-characters long.
Storing a username, with max length 30 Store in VARCHAR(30) field
Storing a user's continent ENUM('Asia', 'Africa', 'North America', 'South America', 'Antarctica', 'Europe', 'Australia')

There may be some better, space-saving ways; if there are, please let me know.

The End

Sorry, but you've reached the end of part 1. I hope you enjoyed this article, and I hope it has opened your eyes to the vast range of storage methods and mechanisms available, which could save some database space, and sometimes improve processing efficiency.

Don't worry! Another part will be coming soon. If you want, subscribe to the RSS feed, and you can be the first to get part 2!

3 Comments »

  1. Choosing Optimal MySQL Data Types | Making the web Said,

    April 7, 2008 @ 8:05 pm

    […] I have also written a more in-depth series of articles about how efficient data storage can be achived with some background knowledge: the first part focuses on the storage of strings. […]

  2. matt Said,

    April 21, 2008 @ 8:57 am

    Excellent article, I picked up a few things. The only thing that I didn't quite grasp was what ENUM was, exactly. But I am designing a database at the moment and I intend to use some of these useful techniques, so thanks.

  3. Brendon Said,

    April 21, 2008 @ 5:41 pm

    Thanks for your comment Matt,

    ENUM is one of the most over-looked data types in MySQL. It can be extemely useful in some situations, and I recommend you understand it fully: see http://dev.mysql.com/doc/refman/5.0/en/enum.html.

    -Brendon.

RSS feed for comments on this post · TrackBack URI

Text Link Ads

Leave a Comment

Making the Web | Chalvedon School and Sixth Form College | Messenger History | GCSE(WIKI): Simple bitesize revision for secondary school, KS4 students. | Encrypt files for free - high-grade encryption
Who's Populating The Web?