Recently completed a social media website that has several facebook/twitter aspects, one of the main concerns is something known as the super-user, a user that’s so popular that each action by this user will impact thousands of other connected users on the site. Designing for scalability and speed for this range of users require some special thinking.
I’ve came up with a theory that “Normal” MySQL tables would be severally taxed if we let the DB classes run each time a user refreshes his or her walls.
Databases are “usually” meant for storing data (usually) in 3rd normal form but social sites tend to have many lists such as friends, messages (listed into walls),a term i’ve regarded as “relationship” data. The problem with this data is that it’s repetitive and somewhat useless beyond the scope of the individual. Making sense not to store such data into table but into the person’s record (row) as longtext/blob, and retrieving when needed, this blob isn’t indexed in any way so it’s just sitting in the row minding it’s own business until needed.
PHP seems to have a solution for that http://www.php.net/manual/en/function.serialize.php unfortunately that solution isn’t viable in this circumstance.
Imagine this scenerio
1) Users have many (many) friends
2) When the user updates his wall, his friends wall get updated (friends in thousands?!) with the same message
3) Running a PHP script that pulls out ALL his friends’ wall and updating it then returning the new data back into their wall would probably result in hitting the memory max and stopping the script. Having something like this run on the server would introduce tons of load spikes.
4) PHP arrays are heavy in memory and SLOW, searching and moving data in-out of arrays are slow too.
5) Serialized PHP arrays are hard to read and even harder to manipulate in DB w/o screwing up the formatting.
So what do you do?
1) Maintain all “non-relationship” data in 3rd normal form in their respective tables. Use primary keys (auto increments).
2) The primary user table would have several “blob”s to store things like list of friends, personal wall in the following format
'1','2','3','4','525235'
3) Use CSV to maintain user’s friends and wall. CSV being just “string” would be easily manipulated in PHP and in SQL, you can append directly to the front of several thousand friends’ wall with just a single SQL statement
$friend_list = SELECT friends FROM user_table; //Pseudo text, returns friend's list in '1','2','555','666' format which is directly pumped into the next SQL statement
UPDATE user_table SET wall = REPLACE(TRIM(BOTH ',' FROM CONCAT_WS(',','\'".$message_id."\'', wall)), ',,', ',') WHERE uid IN (".$friend_list.")");
4) In PHP to make things even simplier use this CSVList class (shameless created by me) to manipulate the CSV string
Download it here: http://www.phpclasses.org/package/6218-PHP-Manipulate-a-list-of-entries-in-a-string-of-separa.html
The above class allow quick search for a variable in the CSV, mimics SQL LIKE (% X %), exports CSV and in array, exports a certain range in the CSV (aka LIMIT 20,30 – useful for pagination), counts no of variables in the CSV and some other stuff, so it’s worth checking it out. By itself, it’s possible to handle very high range of variables, 20-30 times higher then if the variables were stored in a simple (standard/non-assoc) php array. It’s alot faster then looping through an array to check for existence or matches of a variable.