Greenguy's Board - View Single Post

cd34 · 2006-04-12, 12:32 PM

It is my belief that extremely broken html will cause extreme problems.

What the browser does, and how a google bot reads a page are totally different. Google uses Python, so, we'll assume that they use Python for their bot. Python has an sgml parser which takes your page, dissects it into a tree structure, then goes to work on that.

Things like

Code:

<a href="page.html>hi there</a>

Will sometimes properly be rendered in browsers. Hanging table cells, improperly nested cells, etc -- all go towards making an automated process have problems. It used to be that you didn't need to close the <td>, <tr>, <b>, etc, however, those all break automatic parsing.

While I don't follow every recommendation that the validators give -- I do make sure that the html isn't broken. There are pages I have that have incorrect html, bgimage, bgcolor, etc as attributes on html that isn't in the standard, but, I'll let that slide. That won't break a parser.

However, improperly nested content can sometimes cause problems.

Code:

<a href="page.html"><h1>hi there</a></h1>

An automated process will get confused with the above. Depending on how they are parsing, I would suspect you might lose the effect of the <h1>. Now, google probably goes to all lengths to make sure they can spider the web to the best of their ability, but, why gamble on that?

2006-04-12, 12:32 PM	#7
cd34 a.k.a. Sparky Join Date: Sep 2004 Location: West Palm Beach, FL, USA Posts: 2,396	It is my belief that extremely broken html will cause extreme problems. What the browser does, and how a google bot reads a page are totally different. Google uses Python, so, we'll assume that they use Python for their bot. Python has an sgml parser which takes your page, dissects it into a tree structure, then goes to work on that. Things like Code: <a href="page.html>hi there</a> Will sometimes properly be rendered in browsers. Hanging table cells, improperly nested cells, etc -- all go towards making an automated process have problems. It used to be that you didn't need to close the <td>, <tr>, <b>, etc, however, those all break automatic parsing. While I don't follow every recommendation that the validators give -- I do make sure that the html isn't broken. There are pages I have that have incorrect html, bgimage, bgcolor, etc as attributes on html that isn't in the standard, but, I'll let that slide. That won't break a parser. However, improperly nested content can sometimes cause problems. Code: <a href="page.html"><h1>hi there</a></h1> An automated process will get confused with the above. Depending on how they are parsing, I would suspect you might lose the effect of the <h1>. Now, google probably goes to all lengths to make sure they can spider the web to the best of their ability, but, why gamble on that? __________________ SnapReplay.com a different way to share photos - iPhone & Android