Greenguy's Board - View Single Post

cd34 · 2008-12-08, 08:52 PM

any site ripper would work

if you're comfortable with wget, you could do something like:

wget -r -l 99 http://site.com/

you MIGHT try it without adding -F to the command line which will put in <base href> for relative urls

--limit-rate=300k would limit the fetch to 300k/second... if you do one site at a time, depending on the page composition, that might be enough to keep from killing the server when it spiders everything.

It will put it in a directory structure which you can shuffle around. If a page is not linked to, it won't get spidered here, but, you can supply an input file of urls, so, if you have a sitemap, you could parse that to pull the pages not internally crosslinked. With wordpress, I don't think this is a problem.

2008-12-08, 08:52 PM	#15
cd34 a.k.a. Sparky Join Date: Sep 2004 Location: West Palm Beach, FL, USA Posts: 2,396	any site ripper would work if you're comfortable with wget, you could do something like: wget -r -l 99 http://site.com/ you MIGHT try it without adding -F to the command line which will put in <base href> for relative urls --limit-rate=300k would limit the fetch to 300k/second... if you do one site at a time, depending on the page composition, that might be enough to keep from killing the server when it spiders everything. It will put it in a directory structure which you can shuffle around. If a page is not linked to, it won't get spidered here, but, you can supply an input file of urls, so, if you have a sitemap, you could parse that to pull the pages not internally crosslinked. With wordpress, I don't think this is a problem. __________________ SnapReplay.com a different way to share photos - iPhone & Android