The number of differences between the HTML and XHTML formats may be greater than it looks at
first sight.
Even the W3C and the WHATWG tend to downplay the differences, suggesting that it's really the same language,
with only minor deviations in syntax, but if you look deeper, it becomes obvious how many dissimilarities there
really are.
Now I haven't seen many documents that describe all the differences in one place, so here goes.
Feature | HTML | XHTML |
---|---|---|
MIME type | Should be text/html, but many others accepted by browsers | Either application/xhtml+xml or application/xml |
Content-type:text/html |
Content-type:application/xhtml+xml |
|
XML declaration | None (although many browsers ignore it) | Must be present if the encoding is not "UTF-8" (although many browsers ignore it) |
|
<?xml version="1.0" encoding="UTF-8"?> |
|
New in XHTML: processor instructions | (not possible in HTML) | E.G. to associate a stylesheet |
|
<?xml-stylesheet href="sheet.css" ?> |
|
Doctype declaration | Scanned to determine if quirks or standards mode should be used | Actually parsed according to the MD rules |
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN" |
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" |
|
Additional data in doctype declaration | Ignored | Parsed; and the data is used |
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN" |
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" |
|
Namespace | Namespace? Never heard of | XHTML namespace must be declared in <html> start tag |
<html> |
<html xmlns="http://www.w3.org/1999/xhtml"> |
|
Optional start tags | for html, head, body, colgroup, tbody (but it's recommended to use them anyway) | Start tags are always required (but see note about tbody) |
|
<head> |
|
Void elements | base, link, meta,
area, basefont, bgsound, br, col, command, embed, frame, hr, img, input, isindex, keygen, menuitem, param, source, spacer, track, wbr |
There are no void elements in XHTML. All elements must be closed. |
<link rel="stylesheet" href="style.css"> |
<link rel="stylesheet" href="style.css"/> |
|
Selecting encoding with <meta> | If no other info is available, the browser can determine the document encoding by looking at the meta charset command. | XHTML ignores the meta charset command. |
<meta charset="Windows-1251"> |
<meta charset="Who cares"/> |
|
Element & attribute names, some values | Case insensitive | Lowercase |
<TITLE Dir="ltr">HTML document</Title> |
<title dir="ltr">XHTML document</title> |
|
Optional end tags | body, caption, colgroup, dd, dt, head, html, li, optgroup, option, p, rb, rt, rtc, rp, thead, tbody, tfoot, td, th, tr | End tags are always required (but see note about tbody) |
|
</head> |
|
Recognition of entity references | HTML knows all entities by name and handles all references | Handles only entities which are in the DOCTYPE declaration |
This is my résumé. |
This is my résumé. |
|
Handling of attribute values | Treated as text, except references are parsed | Parsed like all XHTML text |
<body title="An <HTML> file!"> |
<body title="An <XHTML> file!"> |
|
New in XHTML: CDATA sections | Always escape < and > chars in HTML | Text with many < and > chars can be put in a CDATA section |
[ -[ -<<[-]>> ]<<[<<-
>>-]>> ]<<[<<] |
<![CDATA[ [ -[ -<<[-]>> ]<<[<<->>-]>> ]<<[<<
]]> |
|
New in XHTML: element minimization | Non-void elements must have their start and end tags written out in full | All elements can be shortened if empty (this is strongly discouraged though!) |
<script type="application/javascript" src="lib.js"></script> |
<script type="application/javascript" src="lib.js"/> |
|
Contents of <script> and <style> | Everything treated as content up to the first end tag | Parsed like all XHTML text |
<script type="application/javascript"> |
<script type="application/javascript"> |
|
Tag names in JavaScript | In JavaScript, tag names are uppercase | In JavaScript, tag names are lowercase |
if (document.body.children[0].tagName==='H1') ... |
if (document.body.children[0].tagName==='h1') ... |
|
document.write in JavaScript | You can use document.write |
You cannot use document.write |
document.write('This works'); |
document.write('This does not work'); |
|
Attribute values | Require quotes if containing special chars, spaces etc. | Always require quotes |
<table border=1> |
<table border="1"> |
|
Boolean attributes | Allowed | Disallowed. All attributes must have values |
<td nowrap> |
<td nowrap="nowrap"> |
|
Structural errors | Browser tries to work around them to display the parts that don't contain errors. | Browser shows an error message. |
That's it. Make sure not to mix HTML and XHTML features in the same document; keep the two separated.
If you use one XHTML feature, like the doctype declaration, go XHTML all the way! Use the correct content-type,
don't put script content inside comments, don't have any errors, etc.
To distinguish between HTML and XHTML, the MIME type is the most important factor. If a file does not have an XML
file type, it is not XHTML, no matter what the contents of the file say. It will be handled as if it's HTML.
Note that while the official definition only has application/xhtml+xml
or application/xml
for MIME types, many systems use text/xml
for files with extension ".xml".
According to the W3C standards, the XML declaration should be present, along with its encoding
attribute, in all XML files. The parser should know what encoding the file has, so it won't have to guess.
Therefore, if the encoding is not the default "UTF-8", the file must contain the proper declaration.
(However, the browsers don't really need it; they will happily go guessing at the encoding if the declaration isn't
there. And although the official definition says omitting the XML declaration is not an error, it doesn't mention
what happens if the HTTP header declares a different encoding, so it's best to play it safe.)
Also, the version
should always be 1.0. Other versions will confuse some browsers.
A third parameter to the XML declaration is standalone
, which should be left at its default value of
"no" to keep the DTD working.
With HTML, in the absence of a document type declaration, the file is treated as an old HTML file and is rendered
in quirks mode. And ideally, with a proper doctype declaration - one of the officially recommended ones - the file is rendered in
standards compliance mode. XHTML files are always rendered in standards mode, so XHTML files don't really require
one.
However, in real life the rules are more complicated. Many browsers have more than one mode (e.g. extra
like-in-the-previous-version modes, almost-but-not-quite-entirely unlike the standard modes, etc). The W3 Validator
recognises different declarations than the W3 ones and can parse files according to those different rules. Most
browsers recognise entity references like é
in XHTML mode only when they have a doctype
declaration that includes a DTD. And so on.
As for XHTML5, my recommendation would be to use it only when you know what you're doing. XHTML files with a HTML5
doctype don't have those entity references, so you will get errors if you use them. However, keep reading.
In XHTML, since its doctype declaration is parsed according to the MD rules, you can put extra data in here, such
as new entity names for characters. Example:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"
[ <!ENTITY larrw "↜"> <!-- Leftwards wave arrow --> ]
>
so that you can write &larrw; for a leftwards wave arrow instead of having to remember the numerical
reference ↜.
As an XHTML file is actually an XML file, its default namespace is the XML namespace,
http://www.w3.org/XML/1998/namespace
. To turn such a file into HTML, you will need to specify the HTML
namespace http://www.w3.org/1999/xhtml
explicitly. Otherwise, the browsers won't know what to do
with the contents of the file.
HTML files need no explicit namespace specification, because they already know what kind of files they are.
Ditto with <math>
and <svg>
elements embedded in the document, as well as
XLink features: they require namespaces in XHTML, but can be used as is in HTML.
In HTML up to v4.01, the xmlns
attribute didn't exist, so using it was an error; in HTML5 the
attribute is optional (even if it does nothing).
In HTML, some start tags can be omitted if it's clear in the source where the element should begin.
For example, meta elements can appear only in the head element, so if the parser encounters
<meta>
directly after <html>
, it's clear that the head should start just
before this meta element.
Note: leaving out start and end tags does not mean that the element is not there. If it helps, think of the element
having invisible tags.
By the way, leaving them out is strongly discouraged. In fact, most non-W3C DTDs don't support this at all
(example 1,
example 2).
In HTML, tbody is implied. That is, tables always have at least one tbody, whether the <tbody>
and </tbody>
tags are present or not. Tables in XHTML can make do without tbody; they can
contain trs directly. But only if there are no theads or tfoots in the table.
This difference is not very clear when simply viewing the source or the resulting page, but if you're dealing with
JavaScript, the distinction is very important. For example, take the following source:
<table><tr><td>table</td></tr></table>
In JavaScript under HTML, the first child of the table element is a tbody element. Under XHTML, the first child is
the tr.
The same is true for the colgroup
element.
Void elements ("empty elements" in SGML lingo) are elements where the definition says they have no content and
no end tags. The element ends with the start tag. In HTML, a number of void elements are defined. XHTML has no such
thing; all elements must be closed explicitly.
To ensure compatibility with non-XHTML compliant browsers, those elements defined in HTML as void should be closed
the short way by inserting a slash, like this: <br/>
. (Official sources also recommend using a
space before the slash "for compatibility with some older browsers", but really, if you do want to support
Netscape 4, just don't use XHTML at all.)
The slash can also be used in HTML5, where it is optional; it was an error in HTML4 and earlier.
Elements that are not defined as void, but just have no content, should be written out with a full end tag,
like this: <script src="lib.js"></script>
In HTML, the meta charset command can be used to determine the encoding of the document, as long as there are no
other factors determining it (the HTTP header or the BOM). In XHTML, the meta command is ignored and only the xml
declaration is used.
Older versions of HTML could use a meta http-equiv command, in case the real HTTP header did not provide this
information
<meta http-equiv="content-type" content="text/html; charset=Windows-1251">
but that is also ignored in XHTML. In fact, using http-equiv is considered an error in XHTML5.
In HTML, element names are case insensitive. In XML, names are case sensitive, and the choice was made to make
them lowercase. (They could have chosen uppercase, but apparently one of the designers didn't like the Shift
key.)
You should also use the lowercase versions in scripts and CSS files when referring to the XHTML.
Same with attribute names, including data-*
ones.
Some attribute values, however, are case insensitive. For instance, you can specify <form
method="POST"
in your XHTML file and it will work (although the W3C validator throws an error).
Not all attributes do that though, so for your own peace of mind it's better to play it safe and use lowercase for
these values. Ditto with the x
in hex character references.  
is good,
 
is not.
In HTML, many end tags can be omitted if it's obvious in the source where the element should end.
For example, you cannot have a td nested inside another td, so if the parser encounters <td>
in
a td, it's clear that the current td ends just before the new one. Similarly, if the parser encounters
</tr>
in a td, the current td should end first before ending the tr.
Note: some people think of the <p>
tag as having no content, as just a separator between
paragraphs. That is not correct; the text following the <p>
tag is actually the content of the
paragraph. The entire p element starts with <p>
and ends with either an explicit
</p>
end tag, or anything that causes the paragraph to end, like another <p>
,
or the ending of the enclosing element, for instance, </body>
.
XHTML has no rules like that: all elements must have end tags.
If the element has no content, i.e. when you would have <p></p>
, you'd be able to shorten
that to <p/>
according to the xml rules, but you should not do that in XHTML, because of
problems it can cause with not-quite-compatible parsers.
An entity is the name of a character (or group of characters). For instance, the name of the ñ character is
ntilde. Now in HTML, you can refer to the character by writing ñ
, but in
XHTML, you can do this only if the name ntilde is defined in the DTD. That is, if the file has an
older DOCTYPE declaration which contains the URL to the DTD file. With the later DOCTYPE, you can't use the entity
and you'll have to use the numerical reference, ñ
instead. XML has only five entities built
in: quot
, amp
, apos
, lt
and gt
.
In HTML, attributes are parsed in RCDATA mode: this is almost the same parsing mode as CDATA, except that
references are interpreted. In particular, < and > signs have no special meaning; they are used
unchanged.
In XHTML, there is no RCDATA. Everything is parsed in the same way and having a < in an attribute value would be
seen as the beginning of a tag, which causes a well-formedness error. They should be escaped.
XML has the ability to mark text as character data, unparsed text that can contain <, > and &
characters, by wrapping them in <![CDATA[
.. ]]>
delimiters. This tells the parser that this is one block of text that doesn't get special treatment. You should
make sure that the text itself doesn't contain ]]>
though.
The special treatment only applies to parsing, not to displaying; the text isn't shown as plain text like in
<pre>
elements.
Also note that the sequence ]]>
is reserved for ending a CDATA section; you can't have
]]>
anywhere else in an XHTML file, not even in a <script>
element.
XML has no concept of void or non-void elements, so it doesn't mind how you write the tags for elements without
content. <hr/>
is the same as <hr></hr>
, just as
<div/>
is the same as <div></div>
. Take care, however, that this will
only work if the file is actually parsed in XML mode (i.e. with a file type of application/xhtml+xml), otherwise it
will confuse the browser.
And, again, this practice is strongly discouraged.
The contents of script or style elements are treated differently. In HTML, these elements have
CDATA content; the source is scanned for the first occurrence of an end tag, and everything up to there is handed
to the script interpreter or CSS interpreter.
In XHTML, the source is treated the same way as all other element contents; that is, comments are ignored (read:
thrown away), references are converted to their corresponding characters, etc. This means that if you put the whole
chunk of script inside a comment, the script interpreter gets nothing.
There are solutions, but most of these only work in either HTML or XHTML, not in both.
</
anywhere in the content. The combination
</
will be seen as the end tag for <script>
, even if it's between quotes or
inside a JavaScript/CSS comment.<\/
to escape the slash.</
if the whole script was commented out with
<!--
.. -->
, but that is no longer the case.)<
or &
signs.
Also make sure that editing the script will not introduce <
or &
signs later
on.<
and &
respectively.]]>
, you can put the whole content in a
<![CDATA[
.. ]]>
section. This may even work in HTML in
some browsers, but as <![CDATA[ is not part of the HTML standard, this method is not HTML compatible.(There are other HTML elements with a CDATA content model in HTML; some elements of which the content isn't meant
to be displayed, like <noembed>
and <iframe>
, do. And there are older ones
such as <xmp>
and <plaintext>
; since these have long been deprecated, you
shouldn't be using them anyway, not even in classic HTML.)
In JavaScript routines, all element and attribute names as fetched from the DOM tree are in uppercase for HTML
files, even if they were lowercase in the source. In XHTML, all such names are cased the same way they appear in
the source (i.e. lowercase).
document.body.tagName
yields BODY
in HTML, but body
in XHTML.
In XHTML, you can't use document.write(..)
; it will throw an exception. Note that it is deprecated
in HTML too, so you should replace it in all documents anyway, even HTML ones.
In HTML, attribute values containing letters, digits, hyphens or periods don't have to be enclosed in quotes.
However, there's no harm in always using quotes, so there's no need to wonder if they're needed or not. Besides,
some standards have other characters that don't need to be quoted than the ones mentioned here, so don't rely on
that.
In case the value contains quotes itself, use single quotes around it if it contains double quotes or vice versa.
If the value contains both single and double quotes, it's best to replace them with their character references -
"
for a double and '
for a single quote.
In XHTML, and in XML in general, values must be quoted, so again, there's no need to wonder if they're needed or
not.
Boolean attributes can be on or off. In HTML, this is indicated by having the attribute present in the start tag
or not. You don't need to put in the value; the name of the attribute is enough. They don't even have
values!
In XHTML, attributes must always have values, and it's recommended to write attribute="attribute"
,
with the value the same as the name. Although you can give them other values such as the empty string
""
(and most browsers accept this, or anything else for a value), those practices are discouraged.
With XHTML, well-formedness errors are fatal: the browser doesn't display the document, but outputs an error
message instead. In HTML, the browser works very hard to figure you what you wanted.
An example is illegal control characters such as Formfeed (ASCII 12), which XHTML cannot display.
Or the <title>
element outside of the <head>
where it belongs. Some
browsers, upon finding a title in the body of a HTML document, will move it up into the head. With XHTML, no
browser does that. However, a title in the wrong place is not a well-formedness error, just a validity error, so
the browsers won't mind; they will display the document anyway!
Other similar validity errors also get different treatment in HTML and XHTML. E.G. many HTML parsers remove the
tags from caption
elements if they occur outside of tables, leaving only the contents. XHTML parsers
can't do that.
And HTML forgives you if you use a value
attribute on a time
element instead of the
datetime
attribute. XHTML does not.