Predict.ly

Saturday, 7 November 2009

Ideas

Predict.ly is going to be delayed.

I've had a few new ideas and one of them is just too good.

All of the server structure and knowledge gained in the initial stages of predict.ly will be put to good use and indeed predict.ly will eventually become a reality, it's just the other idea has won in priority.

Saturday, 31 October 2009

Servers

Varnish is installed, nginx is back.

Need to configure varnish but that can wait.

varnishstat varnishtop - very nice.

The haproxy check is hitting varnish and getting a 0.9828 cache hit - ok easy when only 1 file is active but its working!

Thursday, 29 October 2009

Architecture

New plan, save me reinventing the caching wheel.

haproxy-->POST DATA---------------------------> apache
haproxy-->gzip enabled client --> nginx -> varnish -> apache
haproxy-->no gzip----------------> varnish -> apache
haproxy-->static--------> nginx

Varnish is a cool caching proxy that understands how to stitch together parts of documents. It's basically SSI on steroids and will handle all caching of page content. The cool thing is I can just add another node with vanish on it, et voila, instance cache. No need to pre-generate etc.

The only downside is that it doesn't produce gzip content (in the way it'll be used in this instance). Thats why nginx will be infront of it, purely to gzip its output.
Lighttpd is getting scrapped again for nginx because I wont be using SSI and it'll push static content bypassing vanish, why cache static stuff?

On the mysql front, I'm looking at DRBD as an initial alternative to replication. This a bit like RAID 1 over a network with failover. It's different to master/master replication in that you can't use both masters concurrently.

A very similar scheme is used on our massive db in work. This one is quite complicated to setup, so it might not happen for now - I will probably give it a go and see how I get on.

Tuesday, 27 October 2009

Sharding Keys 2

Slept on it and I'm happy with the keys.

I'm shying away from an earlier idea of precaching everything on the site and using SSI. The delay is just too big and I'm also concerned about disk IO.

I want votes and other things to be instant.

Questions

Do you allow non members to vote? (keep a tally of members vs visitors voting)

Is it acceptable to show the person voting/etc the update then renew cache to population slightly later?
- if so, what delay?

If I monitored what was active I could change cache procedures dynamically, heavy write frequency reduces the advantage of caching.

That sounds almost inteligent.

Monday, 26 October 2009

Sharding Keys

I've been going over the problem of keys for sharding all day.

I've come full circle and realised auto-increment works perfectly fine.

The trick is that as everything is sharded on user, as long as the userID is present with the items auto-increment, it will work without an issue.

Lets take Jonno prediction 500

Jonno@prediction@500 - seems ok to me

Wherever jonno goes in terms of db shards, his 500th prediction will follow.

This essentially boils down to: find where jonno is, then take prediction 500 from there.

If on the other hand you give each prediction a unique ID, and then try to find it, you'd need to look it up or look in all shards.

This ID will also change anytime more shards are made, thus all urls would break.

In the shard itself using Jonno as the key everywhere isn't so great. Instead its translated to an int, and then works as per usual inside the shard. All cross shard queries would revert to the char username.

Sunday, 25 October 2009

Sharding

Sharding is nearly every web 2.0 sites answer to database scalability.

The advice is to design to allow sharding.

If/when predict.ly becomes immensly popular, data should be sharded.

However, I'd really like to play sharding now. Therefore I've devised (after quite some hours) a scheme to shard the data. Based on what I've read it is fundamentally the web 2.0 scheme of sharding by user ID.

Everything on the site is created/owned by a user (even system users). All keys will be a combination of user/item id. Mysql autonumbers are not really viable anymore so a basic sequence implimentation is needed.

I've gone over the scheme 20 times and I'm fairly confident. The problem will be getting zend framework/doctrine to deal with the sharding at the application level. This looks very useful on dealing with it in doctrine/zend: http://blog.routydesign.com/?p=62

The other issue is - I don't have the hardware to run sharded databases. I only have 2 database servers. However the shards will be placed in differing database names on the same servers. No cross database queries are allowed (not sure if mysql even supports them to be honest).

This will to all intense purposes create an application level that is using a sharded database. When the database actually needs to scale, I will migrate the shards out initially to 2 additional databases which will be a very natural migration as the whole databases can be moved in a single action.

To reduce the impact of this admitidly too early optimisation there will only be 2 shards. The stages that require to query multiple shards will be doubling up on work, but thats the price of sharding.

The biggest potential penalties are tags and comments.

Tags
Initially the tag table will be in a database of it's own with no joins. Each prediction is limited to 5 tags - so a simple select with 5 primary keys to get the tags for a prediction. Getting tags associated with a user will be similar user > predictions > select tags by primary key. If the tags database has to be sharded, then multiple queries will be required.

The problem becomes finding out what tags are the most common. Each user shard will need to queried and then an aggregate produced. Luckily this is not an operation that is required frequently.

Comments
Loading the comments is simple enough because the comments for a prediction are held in the same shard, simple join. Each comment has an author, and we need to know some detail about the author: name/rank/avatar/etc.

This is a killer. It's fine when there are 2 shards. You have at worst 2 queries per prediction view for the comments. However as user is the primary shard mechanism, lets say it went to 26 shards - that's 26 queries just for the comments.

All advice talks about denormalising the data, but if that users rank/avatar change then you'd need to make the update potentially on all 26 shards. Now that I think about it, maybe that isn't too painful? How often would the image url/rank change? Seldom writes compared to heavy reads. Denormalisation might just be the key.

I had toyed with the idea of recording comments/vibes asynchronously however I'm now of the opinion it would damage usability too much. Instead the user posting the vibe/comment will see the change immediately - other users will experience a delay as the cache refreshes however they won't actually know it's delayed. I think the lesson is to show change to the person making the change to confirm it has occurred happily. Everyone else can wait a little while.

3 Weeks old!

The idea for Predict.ly is 3 weeks old today!

Progress has been pretty good for a part time project I think. Servers are in a bit of a mess again, but nothing major - I turned off lighttpd temporarily until I can sort out the log rotations.

Concept is now pretty solid, db is looking good, code is started, pretty bits of the design is started - not bad.

DNS has been switched over to the new webhost and I've started migrating all my other domains too.

Unfortunately my main email has been nocked offline due to old host being useless! Hope to have it back within 48 hours.