Parser Upgrade

General — Shahee on October 7, 2006 at 5:46 pm

We have upgraded our blog parser. For the past few months the php xml parser library we used was having problems while parsing feeds from some blogs (specially the blogs with beta rss feed from blogger). We tried a walk-around but it also failed to parse all the blogs all the time. This made some users think that we were arbitrarily censoring certain posts.

Most of the time the reason for some of the posts not being able to parse was due to invalid posts by users. And the php parser library is not as forgiving as it should be to non-tech oriented users.

Either way, the new blog parser uses a python library which is more brutal. So far it parses all the posts and generates the recently updated page every 2 hours. Earlier we updated posts every hour; but since 1 hour is too little exposure for new posts, we decided to expand the time between updates.

We have only upgraded the parser, the rest of it (including page generating), is still done by the same php tools we had before. Another small issue is that we assume dates on the feeds of the blogs are in a standard format (RFC 822, W3CDTF, ISO 8601) which we use to convert to MVT.

We are not sure how long the python parser can be sustained, anyway we are in discussion about a possible next version to which may possibly emphasize the purpose of mvblogs more.

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 2.5 License. | mvblogs.org