Greenguy's Board

Greenguy's Board (http://www.greenguysboard.com/board/index.php)
-   Programming & Scripting (http://www.greenguysboard.com/board/forumdisplay.php?f=15)
-   -   Fast 404 error checker for Link Lists (http://www.greenguysboard.com/board/showthread.php?t=23770)

SortLinks 2005-09-07 11:24 AM

Fast 404 error checker for Link Lists
 
does anyone know where to get the subject?
It must check links for 404/redirect...
Php one so I am able to cronjob it :)

The one I have fails from time to time.

SortLinks 2005-09-08 01:58 PM

Please guys, I know you're hiding something :)

gc 2005-09-08 06:39 PM

Do you have the links to be checked in a dbase or does it just have to check from the site?

Regards, Thomas

SortLinks 2005-09-09 01:53 AM

it doesn matter..I think from db will be faster..so..But will be happy with bot also

Mr. Stiff 2005-09-30 11:58 AM

I don't know any PHP script which will do this 'out-of-the-box'

Do you have any coding experience?
If so, check: http://www.php.net/manual/nl/ref.curl.php

If not, feel free to contact me ;)

Joneze 2005-10-01 04:16 AM

You might want to look here

http://www.hotscripts.com/PHP/Script...ing/index.html

SortLinks 2005-10-01 12:43 PM

thanks so much guys!
Mr. Stiff I can php..but confused what to do with curl? how it can help me out?
2Joneze...thanks for the link!

Mr. Stiff 2005-10-02 04:24 AM

Hi,

Curl is a good program for getting webpages, headers, etc. It's installed on most (good) hosting servers.

Here's how I use it:

- Column 'lastspider' on my gallery table
- Query table, getting URL's not spidered the last xxx days/hours/weeks/whatever
- Use curl extension to connect to URL.
- You can choose only to download headers, which is much faster than downloading the full page
- Check header respons (must be 200). If it's 404 -> page not found, 301 or 302 -> redirect)
- Update your table!

venturi 2005-10-03 11:58 AM

Quote:

Originally Posted by Mr. Stiff
Hi,

Curl is a good program for getting webpages, headers, etc. It's installed on most (good) hosting servers.

Here's how I use it:

- Column 'lastspider' on my gallery table
- Query table, getting URL's not spidered the last xxx days/hours/weeks/whatever
- Use curl extension to connect to URL.
- You can choose only to download headers, which is much faster than downloading the full page
- Check header respons (must be 200). If it's 404 -> page not found, 301 or 302 -> redirect)
- Update your table!

You'll want to be a bit more discerning than just ensuring you get a "200 OK" response. If you'd like a script coded up let me know - however most of the better LL scripts already have checkers built into them.

SortLinks 2005-10-03 02:03 PM

I have checker..it does not use curl() and it fails me..gives invalid results most of the times.
I dont like scripts that are zend since
I use to optimise script myself..making it unique.
Thanks guys, this thread should be usefull for these who dont know about it.

SortLinks 2005-10-07 08:15 AM

mr stiff, its a good idea.."researching" curl right now..
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, "http://www.sortlinks.com");
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_CURLOPT_REFERER, $host);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_USERAGENT, "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)");
curl_setopt($ch, CURLOPT_HTTP_VERSION, CURL_HTTP_VERSION_1_0);
$t = curl_exec($ch);
echo $t;

I was only able get full page like php file()
could you please let me know how to get header and get so called spider response?

gc 2005-10-10 06:51 PM

code = curl_easy_setopt(http_headconn, CURLOPT_NOBODY, 1);

SortLinks 2005-10-11 11:24 AM

Fatal error: Call to undefined function: curl_easy_setopt()

oast 2005-10-14 12:45 PM

On my domains I block all known offline browsers, email harvesters, download managers, etc.
Curl is one of those that I block... because I don't want anyone 'mirroring' my content.

I've tried using scripts to clean out the 404s and redirects, but nothing is 100% accurate. Even manual checking isn't perfect, as you could check at the time the server is going thru a reset for whatever reason.

You should use (and trust) whichever you find the most satisfactory for you... or a combination of 2 or 3.

Just my 2c worth.

SortLinks 2005-10-14 07:33 PM

oast, how do you manage to dist. good bots from bad ones?

oast 2005-10-14 08:13 PM

Thru the User Agent string that (nearly) all programs use to identify themselves.

I use htaccess then to forbid (or redirect) the 'bad boys'.

oast 2005-10-14 08:30 PM

A small extract from the filtering lines of my .htaccess file looks like this:

Quote:

RewriteEngine On

# Blank (or "-") Referer *and* User Agent
RewriteCond %{HTTP_REFERER} ^-?$
RewriteCond %{HTTP_USER_AGENT} ^-?$
RewriteRule .* - [F]
# Address harvesters
RewriteCond %{HTTP_USER_AGENT} ^(autoemailspider|ExtractorPro) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^E?Mail.?(Collect|Harvest|Magnet|Reaper|Siphon|Sweeper|Wolf) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} (DTS.?Agent|Email.?Extrac) [NC]
RewriteRule .* - [F]
When 'code red' was more prevalant than it is today, I redirected all requests to Micro$oft ;)

mod_rewrite can be a very powerful tool if used correctly.

AFAIK Apache is the only server it is available on, but as a large number of hosting companies prefer Apache, you should be OK

oast 2005-10-14 08:41 PM

Quote:

Originally Posted by oast
When 'code red' was more prevalant than it is today, I redirected all requests to Micro$oft

Should I have said that?

Honestly Bill, I was naive at the time. I don't do things like that any more |blowkiss|

Mr. Stiff 2005-10-20 02:59 AM

Quote:

Originally Posted by SortLinks
Fatal error: Call to undefined function: curl_easy_setopt()

That should be just curl_setop($ch,OPTIONS)

Definatly leave the line 'curl_setopt($ch, CURLOPT_USERAGENT, "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)");' This will fool those webmasters checking for Curl ;)

SortLinks 2005-10-20 03:27 AM

thanks ost and Mr. Stiff - good job :)


All times are GMT -4. The time now is 08:07 AM.

Powered by vBulletin® Version 3.8.1
Copyright ©2000 - 2025, Jelsoft Enterprises Ltd.
© Greenguy Marketing Inc