MySQL Connection Management in PHP

(via jpipes.com)

This article is intended to highlight various basic topics concerning proper methods of handling connections to MySQL databases in PHP, guidelines for caching dynamic content, and a technique called "lazy loading". Hopefully by the end of the article you'll have learned how to combat a very widespread and potentially devastating scalability problem seen in an enormous number of PHP web applications...

Comments

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.
cached article

Above link seems broken, so I retrieved this right off of the google cache. All credits go to Jay Pipes.

MySQL Connection Management in PHP - How (Not) To Do Things

I'll warn you right now, this is going to be a long article. More than likely, I'll put a version of this up on the MySQL developer zone and PHP zone. This article is intended to highlight various basic topics concerning proper methods of handling connections to MySQL databases in PHP, guidelines for caching dynamic content, and a technique called "lazy loading". Hopefully by the end of the article you'll have learned how to combat a very widespread and potentially devastating scalability problem seen in an enormous number of PHP web applications.
An introduction to the problem

Before I start the discussion on connecting to MySQL servers via PHP, it's worth pointing out that the relative cost of connecting to a MySQL database, versus connecting to a PostgreSQL or Oracle installation, is very, very low. However, the fact that connecting to a MySQL resource is inexpensive does not mean that connection resources can be abused. Anyone who has ever seen the dreaded "Too many connections" error which occurs when MySQL attempts to service more connections than the number of concurrent connections afforded by the max_connections configuration variable, knows what I am talking about.

Connecting to a MySQL resource, while inexpensive, requires PHP to issue a call to the mysql client API via mysql_connect(). Passed as variables to this function are the connection arguments for the database host, user, password, etc. Once the MySQL database server has received the connection information, it will either return a success value, or a failure. Upon failure, the mysql_error() function can be used to determine what went wrong during the connection attempt (typically user credential problems or the max connections issue).

So, where, you ask, is the problem? Well, the issue that I commonly see is that connections are made to MySQL resources when they do not need to be made. But, you say, almost all web applications serve dynamic content, so therefore doesn't dynamic content virtually require a connection to a database be made?
Well, not really in many, many cases

Let's take as an example a very, very popular PHP web application installed on hundreds of thousands of servers worldwide: the blogging application Wordpress. Now, before I go any further, I want to say that the reader should not think that I am attacking Wordpress in this article, or deliberately trying to point out shortcomings in their software. By contrast, I picked Wordpress to demonstrate that the problem described in this article is widespread among PHP web applications. At the very end of the article, I'll present a patch to the current Wordpress source code which fixes the issue identified in this article. I will post the patch to the wp-testers mailing list after I complete the article. Promise. :-)

Now, a blogging application is indeed a data-driven web application. Typical blog software involves the posting of articles, the management of comments, the display of such articles and comments, and, of course, pages which serve to provide the various RSS and Atom feeds for the blog. So, one might argue that blogging software is highly dynamic, and therefore would necessarily issue calls to connect to the database upon every visit to the blogsite.

This, however, is not particularly true. Even on extremely busy blogs, content doesn't change on a continual basis. This point is even more relevant when you consider that after a certain number of days after an article is posted, content becomes almost entirely static. Keep this point in mind as you follow through the next sections, which walk through the page invocation process which Wordpress executes upon every hit to a blog page (including feed pages).
Investigating Wordpress page invocation

On all PHP pages in the Wordpress main directory — which include index.php, wp-atom.php, wp-rss, etc. — each file begins with the following include:

require(./wp-blog-header.php);

The wp-blog-header.php page does a couple things but mostly serves to include the following:

require_once( dirname(__FILE__) . '/wp-config.php');

OK, so we're heading over to wp-config.php... and we find some good stuff, such as the defines for the database connection parameters, and:

define('ABSPATH', dirname(__FILE__).'/';);
require_once(ABSPATH.'wp-settings.php');

OK, so a quick :find wp-settings.php later, we open up the first meaty file of our page invocation. The first 72 code lines of wp-settings.php do some housekeeping stuff, like checking if the PHP version is adequate and if the MySQL extension is installed (tangent: is it really necessary to do this on every web page invocation?!). After that, we see the following include:

define('WPINC', 'wp-includes');
require_once (ABSPATH . WPINC . '/wp-db.php');

Into the heart of the beast

OK, so thus far there's been nothing spectacular or extraordinary about the code. Just a few includes to make file maintenance orderly, but nothing unusual. However, the wp-includes/wp-db.php file contains perhaps the most common PHP/MySQL gotcha seen in today's web applications.

The file starts out with some defines and then the class definition of the Wordpress database abstraction object, called wpdb. The wpdb class contains the very typical methods commonly seen in a DB abstraction layer: get_col(), get_row(), get_results(), etc, which allow a query string to be passed in and executed against a MySQL database. However, there is one major problem in the design, which manifests itself in the last line of the file:

$wpdb = new wpdb(DB_USER, DB_PASSWORD, DB_NAME, DB_HOST);

Well, what's so wrong with that, you say? Well, what does the new operator do? It creates an object of a class type specified and, during the creation of the object, calls the class constructor, which is the class method with a name identical to the class name — in this case, the method called wpdb, shown here:

// ==================================================================
// DB Constructor - connects to the server and selects a database

function wpdb($dbuser, $dbpassword, $dbname, $dbhost) {
$this->dbh = @mysql_connect($dbhost, $dbuser, $dbpassword);
if (!$this->dbh) {
$this->bail("... content snipped ... ");
}

$this->select($dbname);
}

Can anyone tell what is happening on every page invocation to a Wordpress blog? Yup. A connection is being made to the underlying MySQL database. On a heavily hit blog site, this code can easily lead to the dreaded too many connections error because a connection to the database is being made even for mostly static content. There are a couple ways to combat this problem: Lazy loading and Content Caching. These two techniques can be used together to eliminate a huge portion of the database queries and connections in typical web applications.
Lazy Loading

The code in the wpdb class isn't fundamentally wrong. It just needs some tweaking to ensure that a connection to the database is only made if a query is executed against the database. A technique called lazy loading essentially delays the connection to the database until the last minute, instead of upon creation of the database abstraction object.

The MySQL Forge database abstraction layer uses lazy loading in just this way. The name of the class is SqlConnection, and it has an empty constructor. Instead of connection logic embedded in the constructor, the object has a Connect() method, which looks like the following. The code has been modified only to remove the logic which automatically handles master/slave replication switching:

    /**
     * Attempt to connect the resource based on supplied parameters.
     *
     * @return  boolean
     * @access  public
     * 
     * @param   string  (optional) Host name (Server name)
     * @param   string  (optional) User Name
     * @param   string  (optional) User Password
     * @param   string  (optional) Database Name
     */ 
    function Connect() {
        if (func_num_args() == 4) {
           
            // A different database has been requested other than the
            // standard global config settings
            $host = func_get_arg(1);
            $user = func_get_arg(2);
            $pass = func_get_arg(3);
            $dbname = func_get_arg(4);
           
        }
        else {
            $host = SQL_HOST;
            $user = SQL_USER;
            $pass = SQL_PASS;
            $dbname = SQL_DB_NAME;
        }

        /**
         * Short circuit out when already
         * connected.  To reconnect, pass
         * args again
         */
        if (is_resource($this->_Cnn) && func_num_args() != 4) {return true;}
           
        if (! $this->_Cnn = mysql_connect($host, $user, $pass)) {
            trigger_error(get_class($this) .
                          "::Connect() Could not connect to server: " .
                          mysql_error(), E_USER_ERROR);
            return false;
        }
        else {
            if (! mysql_select_db($dbname, $this->_Cnn)) {
                trigger_error(get_class($this) .
                              "::Connect() Could not connect to specified database on server: " .
                              mysql_error(), E_USER_ERROR);
                return false;
            }
            else {
                return true;
            }     ;     
        }
    }

The _Cnn member variable is a reference to a MySQL database resource that is returned upon a successful call to mysql_connect(). Notice that the Connect() method has logic which ensures that if an existing connection has already been made during page execution, then the Connect() method simply returns true.

You may be surprised to find out that, just like Wordpress, the MySQL Forge software creates a database abstraction object upon each call to the MySQL Forge website. The following code is included in all page invocations:

/**
* Fine to establish a global connection
* here, since connect doesn't occur until
* SQL execution.
*/
require_once(DIR_LIB . 'class/SqlConnection.php');
$GLOBALS['Db'] =& new SqlConnection();

The difference is that a connection to the database is not made in the SqlConnection class constructor, so having the object instantiated doesn't consume database resources, unlike the wpdb class. So, if the constructor doesn't call Connect(), then when exactly will mysql_connect() be called? Here, we see the Execute() method of SqlConnection:

    /**
     * Executes the supplied SQL statement and returns
     * the result of the call.
     *
     * @return  bool  
     * @access  public
     * 
     * @param   string  SQL to execute
     */ 
    function Execute( $Sql ) {
       
        /* Auto-connect to database */
        if (! $this->_Cnn) {
            $this->Connect();
        }
       
        if (!$this->_Res = mysql_query($Sql, $this->_Cnn)) {
            trigger_error(get_class($this) .
                          "::Execute() Could not execute: " .
                          mysql_error() .
                          " (SQL: " . $Sql . ")", E_USER_ERROR);
            return false;
        }
        else {
            return true;
        }
    }

In the Execute() method, you can see that if the _Cnn member variable is not set (meaning, a previous connection has not been made to the database), then the SqlConnection connects, otherwise, it simply executes the supplied string against that connection via the mysql_query() function and stores the returned result resource in the _Res member variable for use by other methods in the class.

Other methods of SqlConnection simply wrap the Execute() call and provide result sets in various forms. What this means is that on page evocations to MySQL Forge, unless dynamic data is actually needed, no connections to the database are actually created. Which leads us nicely to the other technique for handling semi-dynamic content web requests: Content Caching.
Content Caching

Caching is perhaps the most fundamental concept discussed in the field of computer sciences when it comes to performance of both hardware and software. A cache, simply defined, is a storage area for data that has been parsed, retrieved, calculated, or otherwise generated in an expensive operation. The cache functions to alleviate the need for various resources to regenerate the cached data upon every request for the data.

Caches exist everywhere in both hardware and software. For instance, on a hardware level, modern CPUs usually have at least two levels of hardware caches (usually called the L1 and L2 caches). These CPU-connected fast-access caches exist so that the CPU does not need to call a kernel-level RAM memory page access call, which is a relatively expensive operation since the speed of access to a RAM page is much slower than the access speed to the locally connected Lx caches. When speaking about caches, it's important to recognize that everything is relative to something else. Accessing a hard disk is much more expensive than accessing a page of RAM, which is much more expensive that accessing a line of bytes stored in the L1 cache. Likewise, in application-specific caches (which we'll be talking about next), the relative cost of accessing cached data is lower than retrieving the same information from the MySQL database.

So, let's talk a bit about basic content caching for a PHP web application. Although these examples use PHP, the discussion of application caching applies to all languages. Every web scripting language provides similar functionality to implement caching.

Application content caching occurs when a standard call to the database is replaced with a call to an application content cache. In these examples, we'll implement a simple file-based cache; other solutions are, of course, available, including using memcached or a static content web server proxy to serve web content. Wordpress actually implements its own caching mechanism, called ObjectCache. You can take a look at the implementation in the wp-includes/cache.php. However, this implementation has a couple design limitations that make it unsuitable for a discussion on general caching. It uses a tight coupling with other Wordpress functions and objects, which makes the caching mechanism unfriendly for general re-use.
A Simple File Cache Engine

Before we get into the implementation of the CacheEngine class that MySQL Forge uses, let's first take a look at some code from the /lib/class/ProjectMemberFinder.php class that handles requests to retrieve information about the members involved in a project listed in the MySQL Forge project directory:

    /**
     * Return project members based on project ID value
     *
     * @return  array
     * @param   int     project ID
     */
    function &GetByProjectId($Project) {
   
        /**
         * ProjectMembers don't change that often,
         * so cache the output of these calls.
         */
        $cache_id = 'project_members-' . $Project;
       
        if ($cache = $GLOBALS['CEngine']->GetFromCache($cache_id, $Seconds=0, $IsObject=true)) {
            return $cache;
        }
       
        $sql = "SELECT
                    pm.project
                  , pm.member
                  , fu.display_name
                  , pmr.description as role
                  , pm.can_write
                  , pm.can_read
                  , pm.joined_on
                  , pm.last_source_login
                  , pm.last_source_commit
              FROM " . $GLOBALS['SqlTables']['ProjectMember'] . " pm
                  INNER JOIN " . $GLOBALS['SqlTables']['ForgeUser'] . " fu
                      ON pm.member = fu.user_id
                  INNER JOIN " . $GLOBALS['SqlTables']['ProjectMemberRole';] . " pmr
                       ON pm.role = pmr.project_member_role_id
              WHERE pm.project = " . (int) $Project;

        $results = $GLOBALS['Db']->GetRecords($sql);

        $GLOBALS['CEngine']->WriteToCache($cache_id, $results);

        return $results;
    }

OK, so the first thing you will notice is that there's a comment saying basically, "look, this information really doesn't change all that much. Let's go ahead and cache the results of the database query for later re-use". We first ask the global CacheEngine object ($GLOBALS['CEngine']) if we have a cached version of the supplied Project's project members list:

if ($cache = $GLOBALS['CEngine']->GetFromCache($cache_id, $Seconds=0, $IsObject=true)) {
    return $cache;
}       

The GetFromCache() method of the CacheEngine class returns the requested data, or FALSE. So, in the above code, we simply return the cached data if it is available in the cache. The $Seconds argument to the GetFromCache() method is simply the number of seconds that the cached data should be considered valid. Passing a zero as this argument means we always consider the data valid. The $IsObject argument tells the CacheEngine to return the cached data as an array or an object. We'll see how this is implemented in a little bit.

OK, so if the cached data does not exist in the cache, the ProjectMemberFinder::GetByProjectId() method continues on to request the data from the underlying database. The global Db abstraction layer object (described earlier) has its GetRecords() method called, with a SQL string passed as a parameter:

$results = $GLOBALS['Db']->GetRecords($sql);

It is the next line of code that facilitates the caching of this data in our content cache:

$GLOBALS['CEngine']->WriteToCache($cache_id, $results);

So, upon the first invocation of the GetByProjectId() method of ProjectMemberFinder, for each unique supplied Project ID value, we issue a request to the database and then cache the results for each subsequent call to the function. This saves us an enormous amount of database interaction, increasing the overall scalability of the system since the software can handle more concurrent requests, since the database connection will no longer be a bottleneck to the system.

There are a couple cases that we need to handle when processing cache requests, including how to invalidate data in the cache. We'll get to these cases in a minute. First, let's take a look at the CacheEngine class' two main methods: WriteToCache() and GetFromCache().
The GetFromCache() Method

As you saw above, the GetFromCache() method takes three arguments and returns either FALSE, or the cached data. Let's take a closer look at the CacheEngine::GetFromCache() method.

/**
* Retrieves a Cache file and returns either an object
* or a string
*
* @return mixed
* @param string Name of File in Cache
* @param int Number of Seconds until File is considered old
* @param bool Return an object from Cache?
* @access public
*/
function GetFromCache( $FileName , $Seconds = 0, $IsObject = false) {

$this->_BuildFileName($FileName);

$return = false;
if ($Seconds == 0) {
if (file_exists($this->_CacheFilePath)) {
$return = $this->_ReadFromCache(); /> }
else {
return false;
}
}
else {
        $refresh_time = time() - (int) $Seconds;
if (filemtime($this->_CacheFilePath) > $refresh_time) {
$return = $this->_ReadFromCache(); /> }
else {
                                /** Cached data not valid, remove it */
                                $this->RemoveFromCache($FileName);
return false;
}
}
if ($IsObject) {
    $return = unserialize($return);
}
return $return;
}

The GetFromCache() function should be fairly easy to follow. The meat of the function lies in either checking that the file exists (if there is no time limit on the cached entry) with the PHP file_exists() function or checking the modification time of the file using the filemtime() function otherwise. The $IsObject flag argument simply runs an unserialize() on the data coming back from the internal _ReadFromCache() method, which we will look at next:

/**
* Reads the local file from the cache directory
*
* @return mixed
* @access private
*/
function _ReadFromCache() {
    $mq_setting = get_magic_quotes_runtime(); />     set_magic_quotes_runtime(0);r />     if (!$return_data = @ file_get_contents($file)) {
    trigger_error(get_class() .
'::_ReadFromCache(): Unable to read file contents'
, E_USER_ERROR);
    }
    set_magic_quotes_runtime($mq_setting);
    return $return_data;
}

This function handles reading the cached data from a cache file. The magic quotes runtime (perhaps the most annoying PHP feature ever) is turned off before reading the file to prevent automatic escaping of certain characters in the data, and then turned back to its original setting immediately after.

So, the reading of a cache file is fairly simple. Let's take a look at the write mechanism of the CacheEngine class. This code is adapted from a technique which George Schlossnagle details in his excellent read, "Advanced PHP Programming" (Developer's Library, 2004) called file swapping. The technique facilitates lock-less writing of a cache file while allowing for simultaneous read requests of the cache file. Let's take a look:

/**
* Writes data to the cache
*
* @return mixed
* @param string File Name (may be encoded)
* @param mixed Data to write
* @access public
*/
function WriteToCache( $FileName, $Data ) {
    if (is_array($Data) || is_object($Data)) {
$Data = serialize($Data);
    }
    $this->_BuildFileName($FileName);
    /**
     * Use a file swap technique to avoid need
     * for file locks
     */
    if (!$file = fopen($this->_CacheFilePath . getmypid(), "wb")) {
trigger_error(get_class($this) .
'::WriteToCache(): Could not open file for writing.'
, E_USER_ERROR);
return false;
    }
    $len_data = strlen($Data);
            fwrite($file, $Data, $len_data);
            fclose($file);
    /** Handle file swap */
    rename($this->_CacheFilePath . getmypid(), $this->_CacheFilePath);    
    return true;
}

The code above opens a temporary file for writing. Notice that the actual cache file name is appended with the getmypid() function output, essentially making the filename a temporary, unique filename. Then, given the fopen() call was able to open the file for writing, the data is then written to the file, and the file closed. Finally, the rename() function is used to change the temporary filename to the actual cache filename. Because the rename() function simply acts on the file inode (a structure which stores information about the file, not the file contents itself, the rename() operation is a) very quick, and b) allows other processes to read from any existing cache file with the existing name without blocking the rename operation.
Cache invalidation

OK, so our CacheEngine class now has most of the functionality needed to effectively cache data from the database. However, we still need a method of removing old cache data files. Hence, the very simple RemoveFromCache() method:

/**
* Removes a cache file from the cache directory
*
* @return mixed
* @param string File Name to remove (will be encoded)
* @access public
*/
function RemoveFromCache( $file_name ) {
    $this->_BuildFileName($file_name);
    if (!file_exists($this->_CacheFilePath)) {
return true;
           }
    else {
        if (!unlink($this->_CacheFilePath)) {
    trigger_error(get_class($this) .
'Unable to remove from cache file'
, E_USER_ERROR);  
    return false;
}
else {
    clearstatcache();
    return true;
}
    }
}

All the above function does is remove the cached file if it exists. So, when would we call this function? Well, if you were paying attention earlier, you wold have already seen one occasion. In the GetFromCache() method, the RemoveFromCache() method is called when the modification time of the file exceeds the expiration time supplied to that function. Additionally, the RemoveFromCache() method would be called when we manually want to remove a cache file, for instance, if the list of project members changes, as this snippet from the /lib/class/ProjectMember.php class illustrates. The snippet comes from the ApproveMembershipRequest() method of that class:

...
            /* Remove from review queue */
            $sql = "UPDATE ... "
            if ($GLOBALS['Db']->Execute($sql)) {
                $GLOBALS['Db']->Commit();
                $GLOBALS['Cache']->RemoveFromCache('project_members-' . $Project);
                return true;
            }
...

As you can see, when an approval is made, the database is updated and the cache for the project members is removed, so that upon the next request for this project's members, the cache file will be regenerated, ensuring valid, up-to-date data.
Summary

So, this article has been an exploration into some simple steps that you can take in order to increase the scalability of your web applications by lazy loading and content caching techniques. Below are the full class files for the CacheEngine class and the SqlConnection class used in the MySQL Forge application. Feel free to use as you wish. Additionally, a patch to the Wordpress source code to enable lazy loading is included below.

* CacheEngine class
* SqlConnection database abstraction class
* Patch to Wordpress to support lazy loading

As always, comments and suggestions for this article are more than encouraged and appreciated.

Comment