Greenguy's Board - View Single Post - robots txt to prevent crawling of freesites

Halfdeck · 2006-09-25, 04:50 PM

Quote:

Out of those options the most sane and easy one is to submit to a very small high quality group of linklists that you know will list you regularly.

I agree Mr Blue.

Quote:

btw Halfdeck, this is a purely academic debate on my part and I respect your opinions...you may be completely right and I can be wrong on it. On the surface, at least to me, the robot.txt thing seems wrong, but I do understand your points on the topic and I'd be curious to see what others think on it as well.

Same here. I don't think we're talking about right/wrong anyway. We're comparing odds.

Quote:

So, there's a certain futility in following the robot.txt advice if it prevents you from easily getting listed at the LLs you're submitting to.

In any case, robots.txt isn't the best way to keep pages out of Google's index. Matt Cutts has stated that even if you disallow urls using robots.txt, if other sites link to that url, Google may list it in its index, albeit url only (no title/description). If you want to hide the url completely from Google's index, the commonly recommended course of action is labeling pages using META noindex tag. I don't know LL scripts, but I doubt META robots tag would interfere with their crawling. Else I dare say they should be rewritten.

Quote:

Essentially you are breaking a Link Lists rules because you're completely negating any benefit the Link List owner was trying to get by having category specific recips.

If robots.txt is one of your LL rules.

My point though, is if you accept mirror free sites, chances are you're getting linked from a supplemental page which does you no good anyway. Also, Google seems to be getting pickier about duplicate content especially from unknown, untrusted, 1 month old domains, so just tweaking the title/meta tag and on-page text may not always be enough to keep a page in the main index.

Let me post an example.

I have a list of free sites here:

http://www.nastyxvids.com/sitemap/

Mind you, I built these free sites before I was even aware of search engines, so this isn't exactly scientific (also, site: search is a bit quirky lately, and you may see something different from what I'm seeing depending on which DC you're hitting). The domain is a little short of 2 years old.

Pages listed in Google's main index:

http://www.google.com/search?q=site%...en-US:official
http://www.google.com/search?hs=6Db&...2F&btnG=Search
http://www.google.com/search?hs=7tv&...2F&btnG=Search
http://www.google.com/search?hs=duv&...2F&btnG=Search
http://www.google.com/search?hs=SaG&...2F&btnG=Search
http://www.google.com/search?hs=JGb&...2F&btnG=Search
http://www.google.com/search?hs=YbG&...2F&btnG=Search
http://www.google.com/search?hs=ewv&...2F&btnG=Search
http://www.google.com/search?hs=eHb&...2F&btnG=Search
http://www.google.com/search?hs=Hxv&...2F&btnG=Search

Most of the LLs I submitted to are getting no link love from my submissions on that domain.

------------------------------------

The way I'd go about free site mirrors now would be this:

/index.html
/main.html
/gallery1.html
/gallery2.html
/doorway1.html -> links to main.html
/doorway2.html -> link to main.html

Provided doorway1.html is significantly different from /index.html, and assuming 100s of templates a submitter uses are significantly different from each other, (and assuming 10,000s of submitted free sites are unique enough in terms of on-page text/HTML structure), and assuming further that a submitter build free sites on a one year+ old, trusted, TBPR 3+ domain, there are plenty of unique text (200-300 words+) on each page ..... I think all pages will be indexed as unique pages in Google, and no robots.txt disallow is needed.

Still, my main objection would be against tactics aimed at artificially boosting your SE rankings. I wouldn't assume grey hat methods like recips (they're not citations or "votes" with minimal traffic value) will work indefinitely.

Quote:

I don't think the link is completley ignored, but I do think it becomes highly devalued.

I don't think anything - which is why I said "chances are" - because I have no concrete evidence either way.

Quote:

Advertising's fine
Buying links for PR: bad
Google senses much

Adam Lasnik (Google's new PR guy):
http://www.webmasterworld.com/google/3079355.htm

Whether he's bluffing or not who knows. I do know Google already detects and kills PageRank transfers on *some* bought links, and I assume the same to be happening with some traded, "made for SE ranking" links.

Another relevant quote (Matt Cutts):

Quote:

After looking at the example sites, I could tell the issue in a few minutes. The sites that fit “no pages in Bigdaddy” criteria were sites where our algorithms had very low trust in the inlinks or the outlinks of that site. Examples that might cause that include excessive reciprocal links, linking to spammy neighborhoods on the web, or link buying/selling.

http://www.mattcutts.com/blog/indexing-timeline/

I still do not agree with the mentality "how can I improve/optimize my ranking without getting penalized?" whch seems to be driving this robots.txt discussion. A better question imo would be "how can I make my site more valuable to visitors, and more visible, so more people will find what they're looking for?"

Bottom line: I see nothing wrong with blocking duplicate content pages using robots.txt or meta noindex tag - that's commonly recommended SEO practice. A free site submitter doesn't gain PageRank by disallowing / noindexing a page. It only prevents duplicate content from being indexed. Tagging a free site page with NOFOLLOW would send me a different signal (a free site submitter trying to hog PageRank), but that's another issue.

P.S. Off topic, but if I ran a LL, I would think about tagging links to free sites with NOFOLLOW, as does Technorati tag pages, which are starting to rank very well on Google. You eliminate the reciprocal linking issue (turn all free site links into one way links), and possible negative trust brought on by linking to supplemental/duplicate content pages on untrusted domains.