HTML vs XHTML

The number of differences between the HTML and XHTML formats may be greater than it looks at first sight.
Even the W3C and the WHATWG tend to downplay the differences, suggesting that it's really the same language, with only minor deviations in syntax, but if you look deeper, it becomes obvious how many dissimilarities there really are.
Now I haven't seen many documents that describe all the differences in one place, so here goes.

Complete table of differences

Feature HTML XHTML
MIME type Should be text/html, but many others accepted by browsers Either application/xhtml+xml or application/xml
Content-type:text/html Content-type:application/xhtml+xml
XML declaration None (although many browsers ignore it) Must be present if the encoding is not "UTF-8" (although many browsers ignore it)
  <?xml version="1.0" encoding="UTF-8"?>
New in XHTML: processor instructions (not possible in HTML) E.G. to associate a stylesheet
  <?xml-stylesheet href="sheet.css" ?>
Doctype declaration Scanned to determine if quirks or standards mode should be used Actually parsed according to the MD rules
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN"
"http://www.w3.org/TR/html4/strict.dtd">
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
Additional data in doctype declaration Ignored Parsed; and the data is used
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN"
"http://www.w3.org/TR/html4/strict.dtd"
(anything goes, except >)
>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"
[ <!ENTITY larrw "&#8604;" -- Leftwards wave arrow --> ]
>
Namespace Namespace? Never heard of XHTML namespace must be declared in <html> start tag
<html> <html xmlns="http://www.w3.org/1999/xhtml">
Optional start tags for html, head, body, colgroup, tbody (but it's recommended to use them anyway) Start tags are always required (but see note about tbody)
  <head>
Void elements base, link, meta, area, basefont, bgsound, br, col, command, embed, frame,
hr, img, input, isindex, keygen, menuitem, param, source, spacer, track, wbr
There are no void elements in XHTML. All elements must be closed.
<link rel="stylesheet" href="style.css"> <link rel="stylesheet" href="style.css"/>
Selecting encoding with <meta> If no other info is available, the browser can determine the document encoding by looking at the meta charset command. XHTML ignores the meta charset command.
<meta charset="Windows-1251"> <meta charset="Who cares"/>
Element & attribute names, some values Case insensitive Lowercase
<TITLE Dir="ltr">HTML document</Title> <title dir="ltr">XHTML document</title>
Optional end tags body, caption, colgroup, dd, dt, head, html, li, optgroup, option, p, rb, rt, rtc, rp, thead, tbody, tfoot, td, th, tr End tags are always required (but see note about tbody)
  </head>
Recognition of entity references HTML knows all entities by name and handles all references Handles only entities which are in the DOCTYPE declaration
This is my r&eacute;sum&eacute;. This is my r&#233;sum&#233;.
Handling of attribute values Treated as text, except references are parsed Parsed like all XHTML text
<body title="An <HTML> file!"> <body title="An &lt;XHTML&gt; file!">
New in XHTML: CDATA sections Always escape < and > chars in HTML Text with many < and > chars can be put in a CDATA section
[ -[ -&lt;&lt;[-]&gt;&gt; ]&lt;&lt;[&lt;&lt;- &gt;&gt;-]&gt;&gt; ]&lt;&lt;[&lt;&lt;] <![CDATA[ [ -[ -<<[-]>> ]<<[<<->>-]>> ]<<[<< ]]>
New in XHTML: element minimization Non-void elements must have their start and end tags written out in full All elements can be shortened if empty (this is strongly discouraged though!)
<script type="application/javascript" src="lib.js"></script> <script type="application/javascript" src="lib.js"/>
Contents of <script> and <style> Everything treated as content up to the first end tag Parsed like all XHTML text
<script type="application/javascript">
if (2<3) alert('fine & dandy');
</script>
<script type="application/javascript">
if (2&lt;3) alert('well-formed &amp; valid');
</script>
Tag names in JavaScript In JavaScript, tag names are uppercase In JavaScript, tag names are lowercase
if (document.body.children[0].tagName==='H1') ... if (document.body.children[0].tagName==='h1') ...
document.write in JavaScript You can use document.write You cannot use document.write
document.write('This works'); document.write('This does not work');
Attribute values Require quotes if containing special chars, spaces etc. Always require quotes
<table border=1> <table border="1">
Boolean attributes Allowed Disallowed. All attributes must have values
<td nowrap> <td nowrap="nowrap">
Structural errors Browser tries to work around them to display the parts that don't contain errors. Browser shows an error message.

That's it. Make sure not to mix HTML and XHTML features in the same document; keep the two separated.
If you use one XHTML feature, like the doctype declaration, go XHTML all the way! Use the correct content-type, don't put script content inside comments, don't have any errors, etc.

Extra info

MIME type

To distinguish between HTML and XHTML, the MIME type is the most important factor. If a file does not have an XML file type, it is not XHTML, no matter what the contents of the file say. It will be handled as if it's HTML.
Note that while the official definition only has application/xhtml+xml or application/xml for MIME types, many systems use text/xml for files with extension ".xml".

Back

XML declaration

According to the W3C standards, the XML declaration should be present, along with its encoding attribute, in all XML files. The parser should know what encoding the file has, so it won't have to guess. Therefore, if the encoding is not the default "UTF-8", the file must contain the proper declaration. (However, the browsers don't really need it; they will happily go guessing at the encoding if the declaration isn't there. And although the official definition says omitting the XML declaration is not an error, it doesn't mention what happens if the HTTP header declares a different encoding, so it's best to play it safe.)
Also, the version should always be 1.0. Other versions will confuse some browsers.
A third parameter to the XML declaration is standalone, which should be left at its default value of "no" to keep the DTD working.

Back

Doctype declaration

With HTML, in the absence of a document type declaration, the file is treated as an old HTML file and is rendered in quirks mode. And ideally, with a proper doctype declaration - one of the officially recommended ones - the file is rendered in standards compliance mode. XHTML files are always rendered in standards mode, so XHTML files don't really require one.
However, in real life the rules are more complicated. Many browsers have more than one mode (e.g. extra like-in-the-previous-version modes, almost-but-not-quite-entirely unlike the standard modes, etc). The W3 Validator recognises different declarations than the W3 ones and can parse files according to those different rules. Most browsers recognise entity references like &eacute; in XHTML mode only when they have a doctype declaration that includes a DTD. And so on.
As for XHTML5, my recommendation would be to use it only when you know what you're doing. XHTML files with a HTML5 doctype don't have those entity references, so you will get errors if you use them. However, keep reading.

Back

Extra data in the doctype declaration

In XHTML, since its doctype declaration is parsed according to the MD rules, you can put extra data in here, such as new entity names for characters. Example:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"
[ <!ENTITY larrw "&#8604;"> <!-- Leftwards wave arrow --> ]
>

so that you can write &larrw; for a leftwards wave arrow instead of having to remember the numerical reference &#8604;.

Back

Namespace

As an XHTML file is actually an XML file, its default namespace is the XML namespace, http://www.w3.org/XML/1998/namespace. To turn such a file into HTML, you will need to specify the HTML namespace http://www.w3.org/1999/xhtml explicitly. Otherwise, the browsers won't know what to do with the contents of the file.
HTML files need no explicit namespace specification, because they already know what kind of files they are.
Ditto with <math> and <svg> elements embedded in the document, as well as XLink features: they require namespaces in XHTML, but can be used as is in HTML.
In HTML up to v4.01, the xmlns attribute didn't exist, so using it was an error; in HTML5 the attribute is optional (even if it does nothing).

Back

Optional start tags

In HTML, some start tags can be omitted if it's clear in the source where the element should begin.
For example, meta elements can appear only in the head element, so if the parser encounters <meta> directly after <html>, it's clear that the head should start just before this meta element.
Note: leaving out start and end tags does not mean that the element is not there. If it helps, think of the element having invisible tags.
By the way, leaving them out is strongly discouraged. In fact, most non-W3C DTDs don't support this at all (example 1, example 2).

Back

Special treatment for tbody

In HTML, tbody is implied. That is, tables always have at least one tbody, whether the <tbody> and </tbody> tags are present or not. Tables in XHTML can make do without tbody; they can contain trs directly. But only if there are no theads or tfoots in the table.
This difference is not very clear when simply viewing the source or the resulting page, but if you're dealing with JavaScript, the distinction is very important. For example, take the following source:
<table><tr><td>table</td></tr></table>
In JavaScript under HTML, the first child of the table element is a tbody element. Under XHTML, the first child is the tr.
The same is true for the colgroup element.

Back

Void elements

Void elements ("empty elements" in SGML lingo) are elements where the definition says they have no content and no end tags. The element ends with the start tag. In HTML, a number of void elements are defined. XHTML has no such thing; all elements must be closed explicitly.
To ensure compatibility with non-XHTML compliant browsers, those elements defined in HTML as void should be closed the short way by inserting a slash, like this: <br/>. (Official sources also recommend using a space before the slash "for compatibility with some older browsers", but really, if you do want to support Netscape 4, just don't use XHTML at all.)
The slash can also be used in HTML5, where it is optional; it was an error in HTML4 and earlier.
Elements that are not defined as void, but just have no content, should be written out with a full end tag, like this: <script src="lib.js"></script>

Back

Meta charset

In HTML, the meta charset command can be used to determine the encoding of the document, as long as there are no other factors determining it (the HTTP header or the BOM). In XHTML, the meta command is ignored and only the xml declaration is used.
Older versions of HTML could use a meta http-equiv command, in case the real HTTP header did not provide this information
<meta http-equiv="content-type" content="text/html; charset=Windows-1251">
but that is also ignored in XHTML. In fact, using http-equiv is considered an error in XHTML5.

Back

Element names

In HTML, element names are case insensitive. In XML, names are case sensitive, and the choice was made to make them lowercase. (They could have chosen uppercase, but apparently one of the designers didn't like the Shift key.)
You should also use the lowercase versions in scripts and CSS files when referring to the XHTML.
Same with attribute names, including data-* ones.
Some attribute values, however, are case insensitive. For instance, you can specify <form method="POST" in your XHTML file and it will work (although the W3C validator throws an error). Not all attributes do that though, so for your own peace of mind it's better to play it safe and use lowercase for these values. Ditto with the x in hex character references. &#xA0; is good, &#XA0; is not.

Back

Optional end tags

In HTML, many end tags can be omitted if it's obvious in the source where the element should end.
For example, you cannot have a td nested inside another td, so if the parser encounters <td> in a td, it's clear that the current td ends just before the new one. Similarly, if the parser encounters </tr> in a td, the current td should end first before ending the tr.
Note: some people think of the <p> tag as having no content, as just a separator between paragraphs. That is not correct; the text following the <p> tag is actually the content of the paragraph. The entire p element starts with <p> and ends with either an explicit </p> end tag, or anything that causes the paragraph to end, like another <p>, or the ending of the enclosing element, for instance, </body>.
XHTML has no rules like that: all elements must have end tags.
If the element has no content, i.e. when you would have <p></p>, you'd be able to shorten that to <p/> according to the xml rules, but you should not do that in XHTML, because of problems it can cause with not-quite-compatible parsers.

Back

Entity references

An entity is the name of a character (or group of characters). For instance, the name of the ñ character is ntilde. Now in HTML, you can refer to the character by writing &ntilde;, but in XHTML, you can do this only if the name ntilde is defined in the DTD. That is, if the file has an older DOCTYPE declaration which contains the URL to the DTD file. With the later DOCTYPE, you can't use the entity and you'll have to use the numerical reference, &#241; instead. XML has only five entities built in: quot, amp, apos, lt and gt.

Back

Handling of attribute values

In HTML, attributes are parsed in RCDATA mode: this is almost the same parsing mode as CDATA, except that references are interpreted. In particular, < and > signs have no special meaning; they are used unchanged.
In XHTML, there is no RCDATA. Everything is parsed in the same way and having a < in an attribute value would be seen as the beginning of a tag, which causes a well-formedness error. They should be escaped.

Back

CDATA sections

XML has the ability to mark text as character data, unparsed text that can contain <, > and & characters, by wrapping them in <![CDATA[ .. ]]> delimiters. This tells the parser that this is one block of text that doesn't get special treatment. You should make sure that the text itself doesn't contain ]]> though.
The special treatment only applies to parsing, not to displaying; the text isn't shown as plain text like in <pre> elements.
Also note that the sequence ]]> is reserved for ending a CDATA section; you can't have ]]> anywhere else in an XHTML file, not even in a <script> element.

Back

Element minimization

XML has no concept of void or non-void elements, so it doesn't mind how you write the tags for elements without content. <hr/> is the same as <hr></hr>, just as <div/> is the same as <div></div>. Take care, however, that this will only work if the file is actually parsed in XML mode (i.e. with a file type of application/xhtml+xml), otherwise it will confuse the browser.
And, again, this practice is strongly discouraged.

Back

Script and style elements

The contents of script or style elements are treated differently. In HTML, these elements have CDATA content; the source is scanned for the first occurrence of an end tag, and everything up to there is handed to the script interpreter or CSS interpreter.
In XHTML, the source is treated the same way as all other element contents; that is, comments are ignored (read: thrown away), references are converted to their corresponding characters, etc. This means that if you put the whole chunk of script inside a comment, the script interpreter gets nothing.
There are solutions, but most of these only work in either HTML or XHTML, not in both.

(There are other HTML elements with a CDATA content model in HTML; some elements of which the content isn't meant to be displayed, like <noembed> and <iframe>, do. And there are older ones such as <xmp> and <plaintext>; since these have long been deprecated, you shouldn't be using them anyway, not even in classic HTML.)

Back

Element names in JavaScript

In JavaScript routines, all element and attribute names as fetched from the DOM tree are in uppercase for HTML files, even if they were lowercase in the source. In XHTML, all such names are cased the same way they appear in the source (i.e. lowercase).
document.body.tagName yields BODY in HTML, but body in XHTML.

Back

document.write in JavaScript

In XHTML, you can't use document.write(..); it will throw an exception. Note that it is deprecated in HTML too, so you should replace it in all documents anyway, even HTML ones.

Back

Attribute values

In HTML, attribute values containing letters, digits, hyphens or periods don't have to be enclosed in quotes. However, there's no harm in always using quotes, so there's no need to wonder if they're needed or not. Besides, some standards have other characters that don't need to be quoted than the ones mentioned here, so don't rely on that.
In case the value contains quotes itself, use single quotes around it if it contains double quotes or vice versa. If the value contains both single and double quotes, it's best to replace them with their character references - &#34; for a double and &#39; for a single quote.
In XHTML, and in XML in general, values must be quoted, so again, there's no need to wonder if they're needed or not.

Back

Boolean attributes

Boolean attributes can be on or off. In HTML, this is indicated by having the attribute present in the start tag or not. You don't need to put in the value; the name of the attribute is enough. They don't even have values!
In XHTML, attributes must always have values, and it's recommended to write attribute="attribute", with the value the same as the name. Although you can give them other values such as the empty string "" (and most browsers accept this, or anything else for a value), those practices are discouraged.

Back

Errors in HTML

With XHTML, well-formedness errors are fatal: the browser doesn't display the document, but outputs an error message instead. In HTML, the browser works very hard to figure you what you wanted.
An example is illegal control characters such as Formfeed (ASCII 12), which XHTML cannot display.
Or the <title> element outside of the <head> where it belongs. Some browsers, upon finding a title in the body of a HTML document, will move it up into the head. With XHTML, no browser does that. However, a title in the wrong place is not a well-formedness error, just a validity error, so the browsers won't mind; they will display the document anyway!
Other similar validity errors also get different treatment in HTML and XHTML. E.G. many HTML parsers remove the tags from caption elements if they occur outside of tables, leaving only the contents. XHTML parsers can't do that.
And HTML forgives you if you use a value attribute on a time element instead of the datetime attribute. XHTML does not.

Back