I did stress test, and spent some time optimizing.
I don't have detailed benchmarks handy, but by way of example parsing the Google home page (once the HTML is loaded in to a variable) takes roughly 120ms on my dev box--a modest VM running SQL 2008. (Tested over a large number of iterations.)
Anecdotally, I noticed that parse time is similar to (perhaps a little faster) http://lint.brihten.com/html/
I was able to decrease parse time down to about a third of what it was in my initial version.
I learned something very interesting: concatenating long strings is a performance killer. Concatenating short strings is fine. A VARCHAR(MAX) that has less than 8000 characters is treated as a short string--with fast concatenation. When the string grows to 8000 characters, concatenation becomes painfully slow.
So to optimize, where concatenation is needed I try to work only with short strings. When I approach 8000 I concatenate the chunk of data to a larger string and then clear out my accumulator variable. In this way I minimize the number of long concatenations.
I know performance for the actual parsing isn't as fast as what we would see if the parser were built in C++, but it seems that performance is solid, consistent and sufficient for some production uses. Also, it is uncertain that a different implementation in C++ could parse and deliver the rows to SQL faster: I suspect that this T-SQL implementation isn't quite as efficient in processing strings, but excels in getting the data into a resultset--meaning that overall performance may be similar to what you could achieve in a different environment.