Greenguy's Board


Go Back   Greenguy's Board > Blogs and Blogging
Register FAQ Calendar Search Today's Posts Mark Forums Read

Reply
 
Thread Tools Search this Thread Rate Thread Display Modes
Old 2006-12-20, 11:10 PM   #1
T Pat
You can now put whatever you want in this space :)
 
T Pat's Avatar
 
Join Date: Aug 2003
Location: Paridise
Posts: 3,244
Send a message via ICQ to T Pat
Lets Talk Aggregator

First of all http://www.rssxxxfeed.com is about four months old now and for those that are curious about how it’s doing without reciprocal links here ya go:
Aug Unique visitors 54 Number of visits 105 SE Hits 36
Sep Unique visitors 1136 Number of visits 2639 SE Hits 394
Oct Unique visitors 2829 Number of visits 5527 SE Hits 1876
Nov Unique visitors 3350 Number of visits 6441 SE Hits 2139
Dec Unique visitors 4274 Number of visits 6763 SE Hits 3500

I’m aggregating 1013 blog feeds, have 53800 individual posts listed and it’s growing by 400+ posts a day.

The two main reasons I decided on not requiring a link back are:
1, I have no track record, if I were Tommy, Linkster, Greenie or Walrus I would have.
2, I don’t want the Search Engines mistaking me for a link farm.

The good thing about doing it this way is I don’t have to worry about pissing someone off because I deleted their blog (I’ve had to delete over 400 already mostly Thumbloger’s). The bad thing is finding the blogs and listing them is a lot of boring work.

I have Links Organizer installed and will be trading links but I’m going to be real selective.

Here’s the shits of the whole thing, the aggregator script I’m using is Newstopica and it’s slowed down to a fucking crawl. I’ve had Sparky and Kaktus take a look and they both have said the coding leaves a lot to be desired. Support for Newtopica is a friggin joke.

I’m to pig headed to just shit can the whole thing, so I got in line to have Kaktus code a custom aggregator for me.

Here’s where I need your help, if you were to have an aggregator written what features would you want?
__________________
How To Keep An Asshole In Suspense

I'll Tell You Later
T Pat is offline   Reply With Quote
Old 2006-12-20, 11:43 PM   #2
cash29
Rock stars ... is there anything they don't know?
 
Join Date: Dec 2006
Posts: 14
Support for tags and categories. I want to start an aggregator myself and I'm finding that most web aggregators don't seem to support it (at least by default). It's sad, I did a web search for PHP-based aggregators and the ones I've found are not that impressive.
cash29 is offline   Reply With Quote
Old 2006-12-21, 12:01 AM   #3
walrus
Oh no, I'm sweating like Roger Ebert
 
walrus's Avatar
 
Join Date: May 2005
Location: Los Angeles
Posts: 1,773
Send a message via ICQ to walrus Send a message via Yahoo to walrus
Accept pings and limit the number of pings you'll accept per day. Tags and a tag cloud would be nice, especially if they read my tags or you could require a minimum of one tag per post that points to your domain and only take the tags that point to you. (I am personally under the opinion that blog roll links are pretty useless) Limit the number of front page posts so that surfers do use the tagging system rather than take the easy route and click whats in front of their face.
__________________
Naked Girlfriend Porn TGP
free partner account
walrus is offline   Reply With Quote
Old 2006-12-21, 12:14 AM   #4
Useless
Certified Nice Person
 
Useless's Avatar
 
Join Date: Oct 2003
Location: Dirty Undies, NY
Posts: 11,268
Send a message via ICQ to Useless
From what I understand, the ever-decreasing speed of an aggregator as it ages is due to the fact that many of them (aggregation scripts) go through and scan what you have stored in your database from feed A and compare it that to feed A's present state. That's a lot of work, especially when you have that many feeds being handled and a now monstrous database. A more efficient aggregator would look at the post dates and add only what's new from a feed, instead of checking to see if old posts have been updated/edited. If they didn't get it right the first time, fuck 'em.
Useless is offline   Reply With Quote
Old 2006-12-21, 12:16 AM   #5
NobleSavage
Lord help me, I'm just not that bright
 
Join Date: May 2006
Posts: 103
Send a message via ICQ to NobleSavage
I'd like one coded in C that generates new static pages every 2-5 minutes.
NobleSavage is offline   Reply With Quote
Old 2006-12-21, 12:29 AM   #6
cash29
Rock stars ... is there anything they don't know?
 
Join Date: Dec 2006
Posts: 14
Quote:
Originally Posted by NobleSavage View Post
I'd like one coded in C that generates new static pages every 2-5 minutes.
There is a python-based script that does just what you're asking for (boviously, not in C):
http://www.planetplanet.org/

Bu if you know some C, you can always optimize the parts that need to be
cash29 is offline   Reply With Quote
Old 2006-12-21, 12:39 AM   #7
cash29
Rock stars ... is there anything they don't know?
 
Join Date: Dec 2006
Posts: 14
Quote:
Originally Posted by Useless Warrior View Post
From what I understand, the ever-decreasing speed of an aggregator as it ages is due to the fact that many of them (aggregation scripts) go through and scan what you have stored in your database from feed A and compare it that to feed A's present state. That's a lot of work, especially when you have that many feeds being handled and a now monstrous database. A more efficient aggregator would look at the post dates and add only what's new from a feed, instead of checking to see if old posts have been updated/edited. If they didn't get it right the first time, fuck 'em.
No, you CANNOT look at the post dates to determine uniqueness of posts. All I would have to do to spam your aggregator would be to update the date of my post without changing its content.

Most feed aggregators should check for the server's Last-Modified/If-Modified-Since response or even better, its ETag/If-None-Match response. These responses tell you if a feed item has changed since the aggregator last checked it. You can read a little more about these here:
http://diveintopython.org/http_web_s..._features.html
cash29 is offline   Reply With Quote
Old 2006-12-21, 12:42 AM   #8
NobleSavage
Lord help me, I'm just not that bright
 
Join Date: May 2006
Posts: 103
Send a message via ICQ to NobleSavage
Quote:
Originally Posted by cash29 View Post

Most feed aggregators should check for the server's Last-Modified/If-Modified-Since response or even better, its ETag/If-None-Match response. These responses tell you if a feed item has changed since the aggregator last checked it. You can read a little more about these here:
http://diveintopython.org/http_web_s..._features.html
Couldn't a smart spamer alter his server response codes?
NobleSavage is offline   Reply With Quote
Old 2006-12-21, 12:48 AM   #9
cash29
Rock stars ... is there anything they don't know?
 
Join Date: Dec 2006
Posts: 14
Quote:
Originally Posted by NobleSavage View Post
Couldn't a smart spamer alter his server response codes?
If he can hack on Apache's source code, then yes. Although, this seems somewhat unlikey: how many webmasters have access to their own Apache? How many know enough C to make Apache behave in that way?
cash29 is offline   Reply With Quote
Old 2006-12-21, 01:49 AM   #10
NobleSavage
Lord help me, I'm just not that bright
 
Join Date: May 2006
Posts: 103
Send a message via ICQ to NobleSavage
Quote:
Originally Posted by cash29 View Post
If he can hack on Apache's source code, then yes. Although, this seems somewhat unlikey: how many webmasters have access to their own Apache? How many know enough C to make Apache behave in that way?
Couldn't you just use:

<?php
header('Last-Modified: ' . $date_string ');
?>

NobleSavage is offline   Reply With Quote
Old 2006-12-21, 03:41 AM   #11
NobleSavage
Lord help me, I'm just not that bright
 
Join Date: May 2006
Posts: 103
Send a message via ICQ to NobleSavage
Quote:
Originally Posted by cash29 View Post
There is a python-based script that does just what you're asking for (boviously, not in C):
http://www.planetplanet.org/

Bu if you know some C, you can always optimize the parts that need to be

Hey Cash,

That looks like a nice script. I'll have to give it a shot. And python is just cool. Thanks!
NobleSavage is offline   Reply With Quote
Old 2006-12-21, 04:39 AM   #12
twn
Shut up brain, or I'll stab you with a Q-tip!
 
twn's Avatar
 
Join Date: Dec 2005
Posts: 118
My little aggregator is 2 years old:
http://www.sexblogdemon.com/

It most important part is the intelligent java spider/bot. It has detection for rss feeded blogs and anti spam. It is now handling over 10.000 feeds and it still hosted on the thumblogger server, imagine that, i guess it can handle over 100.000 feeds with ease.

I must admit it seems to be a bit overkill to make such a big app, but i guess i had much time back in the days
__________________

* Blog Submitter * Free WordPress
twn is offline   Reply With Quote
Old 2006-12-21, 07:30 AM   #13
Useless
Certified Nice Person
 
Useless's Avatar
 
Join Date: Oct 2003
Location: Dirty Undies, NY
Posts: 11,268
Send a message via ICQ to Useless
Quote:
Originally Posted by cash29 View Post
No, you CANNOT look at the post dates to determine uniqueness of posts. All I would have to do to spam your aggregator would be to update the date of my post without changing its content.
Then you drop their feed. Done! One can't expect to let a site to run itself and end up with quality. And if one forces an aggregator to look at every post in a heavy database, you must except that the slow down is going to occur.

Last edited by Useless; 2006-12-21 at 07:34 AM..
Useless is offline   Reply With Quote
Old 2006-12-21, 09:43 AM   #14
cash29
Rock stars ... is there anything they don't know?
 
Join Date: Dec 2006
Posts: 14
Quote:
Originally Posted by NobleSavage View Post
Couldn't you just use:

<?php
header('Last-Modified: ' . $date_string ');
?>

PHP can do that ? Oops, my bad, just goes to show how much of it I really know
cash29 is offline   Reply With Quote
Old 2006-12-21, 09:57 AM   #15
cash29
Rock stars ... is there anything they don't know?
 
Join Date: Dec 2006
Posts: 14
Quote:
Originally Posted by Useless Warrior View Post
Then you drop their feed. Done! One can't expect to let a site to run itself and end up with quality. And if one forces an aggregator to look at every post in a heavy database, you must except that the slow down is going to occur.
Good point. The best solution to spam isn't a technical one, just don't allow spammy blogs on your aggregator
After this discussion, I'm thinking:
* it makes more and more sense to just keep at 50 items or so per feed at any given time. I think this should be enough to keep visitors busy and should keep your script running not too badly.

* Writing your own aggregator is the way to go. Most of the ones I've used were too immature and didn't scale well. Gregarius and Lilina (PHP-based) would both start to choke at around 15,000 feed items (I read a lot of tech blogs).
cash29 is offline   Reply With Quote
Old 2006-12-21, 11:33 AM   #16
T Pat
You can now put whatever you want in this space :)
 
T Pat's Avatar
 
Join Date: Aug 2003
Location: Paridise
Posts: 3,244
Send a message via ICQ to T Pat
wow lot's to think about thanx guys keep um coming
__________________
How To Keep An Asshole In Suspense

I'll Tell You Later
T Pat is offline   Reply With Quote
Old 2006-12-21, 02:35 PM   #17
T Pat
You can now put whatever you want in this space :)
 
T Pat's Avatar
 
Join Date: Aug 2003
Location: Paridise
Posts: 3,244
Send a message via ICQ to T Pat
Sparky my hero did some of his vodoo shit last night and gave me a surprise Christmas present. He improved the hell out of the Newstopica script it's smokin fast now.
__________________
How To Keep An Asshole In Suspense

I'll Tell You Later
T Pat is offline   Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes Rate This Thread
Rate This Thread:

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT -4. The time now is 02:53 PM.


Mark Read
Powered by vBulletin® Version 3.8.1
Copyright ©2000 - 2025, Jelsoft Enterprises Ltd.
© Greenguy Marketing Inc