Greenguy's Board


Go Back   Greenguy's Board > Blogs and Blogging
Register FAQ Calendar Today's Posts

Reply
 
Thread Tools Search this Thread Rate Thread Display Modes
Old 2006-12-21, 12:14 AM   #1
Useless
Certified Nice Person
 
Useless's Avatar
 
Join Date: Oct 2003
Location: Dirty Undies, NY
Posts: 11,268
Send a message via ICQ to Useless
From what I understand, the ever-decreasing speed of an aggregator as it ages is due to the fact that many of them (aggregation scripts) go through and scan what you have stored in your database from feed A and compare it that to feed A's present state. That's a lot of work, especially when you have that many feeds being handled and a now monstrous database. A more efficient aggregator would look at the post dates and add only what's new from a feed, instead of checking to see if old posts have been updated/edited. If they didn't get it right the first time, fuck 'em.
Useless is offline   Reply With Quote
Old 2006-12-21, 12:16 AM   #2
NobleSavage
Lord help me, I'm just not that bright
 
Join Date: May 2006
Posts: 103
Send a message via ICQ to NobleSavage
I'd like one coded in C that generates new static pages every 2-5 minutes.
NobleSavage is offline   Reply With Quote
Old 2006-12-21, 12:29 AM   #3
cash29
Rock stars ... is there anything they don't know?
 
Join Date: Dec 2006
Posts: 14
Quote:
Originally Posted by NobleSavage View Post
I'd like one coded in C that generates new static pages every 2-5 minutes.
There is a python-based script that does just what you're asking for (boviously, not in C):
http://www.planetplanet.org/

Bu if you know some C, you can always optimize the parts that need to be
cash29 is offline   Reply With Quote
Old 2006-12-21, 03:41 AM   #4
NobleSavage
Lord help me, I'm just not that bright
 
Join Date: May 2006
Posts: 103
Send a message via ICQ to NobleSavage
Quote:
Originally Posted by cash29 View Post
There is a python-based script that does just what you're asking for (boviously, not in C):
http://www.planetplanet.org/

Bu if you know some C, you can always optimize the parts that need to be

Hey Cash,

That looks like a nice script. I'll have to give it a shot. And python is just cool. Thanks!
NobleSavage is offline   Reply With Quote
Old 2006-12-21, 12:39 AM   #5
cash29
Rock stars ... is there anything they don't know?
 
Join Date: Dec 2006
Posts: 14
Quote:
Originally Posted by Useless Warrior View Post
From what I understand, the ever-decreasing speed of an aggregator as it ages is due to the fact that many of them (aggregation scripts) go through and scan what you have stored in your database from feed A and compare it that to feed A's present state. That's a lot of work, especially when you have that many feeds being handled and a now monstrous database. A more efficient aggregator would look at the post dates and add only what's new from a feed, instead of checking to see if old posts have been updated/edited. If they didn't get it right the first time, fuck 'em.
No, you CANNOT look at the post dates to determine uniqueness of posts. All I would have to do to spam your aggregator would be to update the date of my post without changing its content.

Most feed aggregators should check for the server's Last-Modified/If-Modified-Since response or even better, its ETag/If-None-Match response. These responses tell you if a feed item has changed since the aggregator last checked it. You can read a little more about these here:
http://diveintopython.org/http_web_s..._features.html
cash29 is offline   Reply With Quote
Old 2006-12-21, 12:42 AM   #6
NobleSavage
Lord help me, I'm just not that bright
 
Join Date: May 2006
Posts: 103
Send a message via ICQ to NobleSavage
Quote:
Originally Posted by cash29 View Post

Most feed aggregators should check for the server's Last-Modified/If-Modified-Since response or even better, its ETag/If-None-Match response. These responses tell you if a feed item has changed since the aggregator last checked it. You can read a little more about these here:
http://diveintopython.org/http_web_s..._features.html
Couldn't a smart spamer alter his server response codes?
NobleSavage is offline   Reply With Quote
Old 2006-12-21, 12:48 AM   #7
cash29
Rock stars ... is there anything they don't know?
 
Join Date: Dec 2006
Posts: 14
Quote:
Originally Posted by NobleSavage View Post
Couldn't a smart spamer alter his server response codes?
If he can hack on Apache's source code, then yes. Although, this seems somewhat unlikey: how many webmasters have access to their own Apache? How many know enough C to make Apache behave in that way?
cash29 is offline   Reply With Quote
Old 2006-12-21, 01:49 AM   #8
NobleSavage
Lord help me, I'm just not that bright
 
Join Date: May 2006
Posts: 103
Send a message via ICQ to NobleSavage
Quote:
Originally Posted by cash29 View Post
If he can hack on Apache's source code, then yes. Although, this seems somewhat unlikey: how many webmasters have access to their own Apache? How many know enough C to make Apache behave in that way?
Couldn't you just use:

<?php
header('Last-Modified: ' . $date_string ');
?>

NobleSavage is offline   Reply With Quote
Old 2006-12-21, 09:43 AM   #9
cash29
Rock stars ... is there anything they don't know?
 
Join Date: Dec 2006
Posts: 14
Quote:
Originally Posted by NobleSavage View Post
Couldn't you just use:

<?php
header('Last-Modified: ' . $date_string ');
?>

PHP can do that ? Oops, my bad, just goes to show how much of it I really know
cash29 is offline   Reply With Quote
Old 2006-12-21, 07:30 AM   #10
Useless
Certified Nice Person
 
Useless's Avatar
 
Join Date: Oct 2003
Location: Dirty Undies, NY
Posts: 11,268
Send a message via ICQ to Useless
Quote:
Originally Posted by cash29 View Post
No, you CANNOT look at the post dates to determine uniqueness of posts. All I would have to do to spam your aggregator would be to update the date of my post without changing its content.
Then you drop their feed. Done! One can't expect to let a site to run itself and end up with quality. And if one forces an aggregator to look at every post in a heavy database, you must except that the slow down is going to occur.

Last edited by Useless; 2006-12-21 at 07:34 AM..
Useless is offline   Reply With Quote
Old 2006-12-21, 09:57 AM   #11
cash29
Rock stars ... is there anything they don't know?
 
Join Date: Dec 2006
Posts: 14
Quote:
Originally Posted by Useless Warrior View Post
Then you drop their feed. Done! One can't expect to let a site to run itself and end up with quality. And if one forces an aggregator to look at every post in a heavy database, you must except that the slow down is going to occur.
Good point. The best solution to spam isn't a technical one, just don't allow spammy blogs on your aggregator
After this discussion, I'm thinking:
* it makes more and more sense to just keep at 50 items or so per feed at any given time. I think this should be enough to keep visitors busy and should keep your script running not too badly.

* Writing your own aggregator is the way to go. Most of the ones I've used were too immature and didn't scale well. Gregarius and Lilina (PHP-based) would both start to choke at around 15,000 feed items (I read a lot of tech blogs).
cash29 is offline   Reply With Quote
Reply


Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT -4. The time now is 04:10 PM.


Mark Read
Powered by vBulletin® Version 3.8.1
Copyright ©2000 - 2025, Jelsoft Enterprises Ltd.
© Greenguy Marketing Inc