A lesson of RegExp: 50x faster with just one line patch

While I'm developing WebSHi (which is the fastest syntax highlighter written by JavaScript), I also write many performance testings for other rivals. One of them is SyCODE Syntax Highlighter, which is written by silverdrag (水月). It derives from the famous SyntaxHighlighter 1.5.x (dp.SH for short) and as silverdrag's words, it should be 5x to 10x faster than original dp.SH.

But unfortunately, my testings can't prove it. Though it won't trigger the "script slowly" dialog like dp.SH when highlighting large file, in most cases, it only shows 2x faster than dp.SH on IE6. On the other side, when I tested it on FF, I was so surprised that SyCODE is extremely slow, it will cost 5s+ for processing a 700 lines JavaScript file while the original dp.SH only half second.

It's very strange, so I digged into SyCODE. I found that SyCODE highlights more words for JavaScript language (currently all my testcases are to highlight some JavaScript source code files). The original dp.SH (and most other rivals) only highlights the keywords of JavaScript language. SyCODE also highlights global names like Array, Boolean, String, etc., and properties and methods like alert, charAt, onclick etc.. That means SyCODE need to do more text searching and replacement. I disabled such features and tested again, this time SyCODE is 2x faster than dp.SH.

So you will think the problem is just the extra words replacement. And what interesting is it just affect FF a lot, even SyCODE do more text processing, it's still faster than (or at least as fast as) dp.SH on other browsers (Safari, Chrome, Opera and IE).

I'm curious about the root cause of the problem. After some researching, I located it. Just one simple function:

GetKeywords: function(str) {
 return '\\b' + str.replace(/\s+/g, '\\b|\\b') + '\\b';

The function GetKeywords is used to generate a regexp for keywords search and replacement. For example, GetKeywords("abstract break byte case catch") will return a regexp /\babstract\b|\bbreak\b|\bbyte\b|\bcase\b|\bcatch\b/.

The code is straightforward, but it's bad and generate a very inefficient regexp.

The keypoint is \b, \b is a word boundary assertion. To test whether a position is a word boundary, the regexp engine need to consider both the left character of the position and the right character of the position. If one is a word character (aka a-z, A-Z, 0-9 and the underscore "_") and the other is not, then it's a word boundary. You see it need both look forward one char and look backward one char. Though \b assertion is not very expensive, each failed match of /\babstract\b|\bbreak\b|\bbyte\b|\bcase\b|\bcatch\b/ will do such look forward/backward 10 times, and JavaScript language has 50+ keywords means each failed match will do 100 times, and SyCODE add 400+ properties/methods words means extra 800+ times!

Of coz, \b assertion can be easily optimized, but our test result shows that FF's regexp engine doesn't do a good optimization at all.

Anyway, there is a very cheap way to solve the problem. Most of those \b assertions are unnecessary. /\babstract\b|\bbreak\b|\bbyte\b|\bcase\b|\bcatch\b/ can be rewrite as /\b(?:abstract|break|byte|case|catch)\b/, those two regexp are equal, the only difference is the latter only need two \b assertion. Yes, we just need two, even SyCODE add 400+ words, we still just need two.

It's trivial to fix GetKeywords:

GetKeywords: function(str) {
 return '\\b' + str.replace(/\s+/g, '\\b|\\b') + '\\b';
 return '\\b(' + str.replace(/\s+/g, '|') + ')\\b';

Let's see the result of applying this one line patch:

Test results of FF3
code linesoriginalpatched

Oops, one line code cause 50x difference.

Besides FF, the patch also help other browsers a lot.

Test results of 4300 lines of code

As we can see, even Chrome, which introduce a very optimized regexp engine, also shows at least 4x difference.

SyCODE derives from dp.SH, the GetKeywords function is also the legacy from dp.SH, and even the new SyntaxHighlighter 2 still use the similar code. Because dp.SH only highlight about 50 keywords for JavaScript langauge, you will not see performance issue like SyCODE, but applying this one line patch still introduce 20% faster on most browsers.

But this patch is not the end. In next article, I will discuss a complex technique to get another 10% to 40% faster for keywords search/replacement.


My blog was blocked

My blog (blog.csdn.net/hax) was blocked in June.


* hax * 海曦
发表于:2008-06-26 00:53:02 楼主




* PrideRock * 荣耀石(PrideRoCK)
发表于:2008-06-27 16:27:521楼 得分:0


* vcleaner * 我没当大哥很久了.......
发表于:2008-06-28 10:09:432楼 得分:0


* hax * 海曦
发表于:2008-06-29 13:42:344楼 得分:0

不就是一篇讲地震的文章么——内容就不说了,实际上我写的是一个辟谣贴,所以这个网监处是脑残,谣言贴和辟谣贴都分不清。虽然这不干CSDN的事情,但是实际这篇文章五月二十几日就删除了,为什么到了六月又封呢?封了又不给说明。所谓网监处曾经因为Sourceforge上有一个freenet项目就把有几万个项目的SF整个封锁了,还曾因为freebsd的名字里有个Free,就把freebsd.org给封了,更有甚至曾经封了Google,现在还在封 wikipedia……脑残度放到世界五千年历史上也是屈指可数。而你们因为我Blog一篇已经被删除的所谓的“非法文章”(非TMD法,有种网监处向法院提起公诉呀,老子行不更名坐不改姓,姓贺,贺龙的贺,名师俊,为人师表的师,英俊少年的俊,曾在某副部级单位的党委宣传部任职,自1997年起网名唤作 hax的就是我了)封了我有数百篇技术文章的blog——那你们不是跟网监处一样脑残了么。


E3*E3 E3+E3+3

E3*E3 E3+E3+3 annual sacrifice


My Girl

Today, 2008-01-07T00:40+0800, She said she will follow me if I go to Beijing. I am moved deeply.


Comment on Forward/Backward keys of Thinkpad

This is my comment on Internet Browser Keys Poll of Lenovo design blog and you can read Back and Forward Again for more background information.


The layout of the forward/backward key is totally wrong. Forward/Backward keys of Thinkpad Just like the previous post shown, it's always inflame me. The keys(arrow keys and f/b keys) are so small and every time I move cursor when I enter some thing may cause my post disappeared. It can be even worse for the Chinese users. Most IME use shift as a switch of english/chinese mode, and press right shift will also accident touch the fucking backward key! I hate them at all!

And it's too silly to force the users disable them in the keyboard customizer, in fact, I only know them when I read this post, and 90% users will never know how to disable this damn feature!

Finally I'm so disappointed that who visit this site are mostly the fans of thinkpad and tend to conservative.

I would say my opinion loudly: it's not about the usefulness of the browser keys, it's all about the silly layout and design of those keys! Who design that should eat their own dogshit!

Chinese Kungfu Emperor KO Google Blogger

It's near 3 month since my last update. Some of my laziness, but the most important reason is Blogger has been knocked out by Chinese Kungfu Emperor (my translation from "功夫王", which is a nickname of GFW, another nickname of it is Mitten Crab King -- "河蟹王", we invent these names because "GFW" may be GFWed too --- for example, search "GFW" in google.cn only returns 48 results, but search in google.com returns 4,110,000 results).

It's not only a shame but also a pride that GFW can do what seaquake can't do. I am shocked when I finally realize that.

But there is also some good news, wikipedia can be accessed from China now, but we don't know what happened, we don't know what will happen. Someone believe they the Chinese network 007s are adjusting the configuration or testing some new weapons. Anyhow, that give me a naive hope that blogger will be back in some day. Yes, that's impossible isn't it? Currently, inblogs.net provide a alternative entry to my blog, though I still need tor to login blogger and post new articles.



Refactory of feedbuner chinese version

Note: This article is written in Chinese.


More choices are coming soon. Where did Chinese go? We heard your feedback and we're taking it back to the shop for more work. Thanks for having your say!