From The Archives
Pseudo Namespaces With PHP and Memcached
created May 27, 2009 at 12:02 am
updated March 9, 2010 at 2:30 am
UPDATED 09 MAR 2010 SEE THIS POST
I have been so incredibly busy lately that I have not had anytime to write here. Anyway I just wanted to take some time to talk about something cool I have been doing for a project I have been working on outside of work.
By now everyone knows what Memcached is so I won't bore you with an introduction. Basically it is an insanely simple caching system designed to eliminate load on the database.
While in an ideal world you would just throw cache on everything and watch your application scale to infinity and beyond, things aren't always that simple. The problem I'm going to show you today has to do with pagination and cached database queries. I will try to make this pretty straightforward so anyone can follow.
There is a user named Pablo Picasso who is using your site. He makes Cap-Sacs for a living and has received 15 messages from people on the site who are interested in purchasing one from him. This means that page 1 displays messages 6 to 15 (the 10 most recent messages) and page 2 displays 1 to 5.
Let's now imagine you cache these pages of results. A new user comes to your site and wants to buy a pink Cap-Sac. They send Picasso a message so your application says, "Oooh Picasso just got a new message. That's pretty rad. Since the query to get his messages is Cached let's just drop the cache for page 1." Now what happens you ask?
The new message shows up on the top of page 1 as expected so what's the problem?? Picasso now has 16 messages. Page 1 will display messages 7 to 16 which is fine, but since page 2 was cached from before it is going to show messages 1 to 5 still. This means that as long as the cache lives poor message number 6 (the last message on page 1 before the new message was recieved) gets dropped from the result set completely. Not to mention that page 1 thinks Picasso has 16 messages while page 2 still thinks he has 15.
This can also allow duplicates of the same item to show up if for example the user had 22 messages and page 1 and 2 were cached but page 3 wasn't. In this case a new message would cause page 3 and page 1 to be correct but since page 2 is still hitting cache, the last item on page 2 would be the same as the first item on page 3.
So what can we do about this? I see a few options
- Don't use memcached for paginated result sets
- Live with missing and/or duplicate items from time to time in your pagination
- Select everything at once and handle the pagination at the PHP level instead of in the database
- Use Pseudo cache namespaces to invalidate all cache keys for paginated queries
- Option 1
- This seems to be the approach that is often taken where I work, and of course it will work, but what happens if the data rarely changes? In that case it seems rather silly to continue to hit the database over and over for the same data.
- Option 2
- Surprisingly I don't think option 2 is a terrible option because it will definitely scale pretty well. Facebook search results have duplicate items all over most likely for this very reason, but since, in my example, I am thinking about displaying accurate data to the user I would stay away from this. For search results I think this is a much more viable option.
- Option 3
- This option is another interesting one. It would arguably scale better than any of the other options I have proposed although the problem here is what if Lady Gaga is using your site and she has 15,000 messages from fans. We don't want to have to select 15,000 items from the database if this were to miss cache.
- Option 4
- Well considering the title of this post I think you all knew this is the road I was going to go down. You get the performance benefits of caching, and you get 100 percent data accuracy.
The approach here is really quite simple and similar to the example found on the memcached FAQ page.
What I have chosen to do is use the DAO name as my main namespace identifier along with the unique userId. Please understand this is a very crude implementation just to illustrate my idea. It is more or less procedural code so I would not recommend doing it like this. I have been using a Cache_Namespace class to handle common logic such as generating the key and grabbing/setting the generation version/incrementing the version. Anyway here goes:
I tried to comment that code pretty well. As you can see the version is used as part of the query cache key so incrementing its value by 1 will make all calls to that method not hit cache. If you have more than one method you want to invalidate you can give them the same namespace key as seen in line 11.
Okay that is all for now.