Regex parsing: Thompson's algorithm

Xiayun Sun

2014-05-18 22:26

This is a repost of my previous blog article at the old domain.

Source code can be found here

Theory

The central idea behind a regex engine is non-deterministic automata, NFA. While the name is scary, it is just a state machine which can transit to other state machines on certain characters or nothing (which is called an "epsilon transition")

For example, below is an NFA for the simple concatenation regex ab:

simple regex1

s0 is the start state. It can accept one character 'a' and transform to state s1. Similarly s1 can only accept one character 'b' and transfer to state s2. s2 is the final state, if we have exhausted the input string and the current state is a final state, the pattern matching is successful.

Below is a slightly more complicated example, matching a?a:

simple regex2

Here the transition with an epsilon means a state can transform into the next state without any input. Therefore this combined NFA matches both aa and a, i.e. a?a.

To build a regex engine is just to build an NFA corresponding to the input regex pattern, which can be done similarly as parsing arithmetic expressions, i.e. push individual NFAs for individual characters onto a stack, upon seeing an operator, eg star/question mark/alternation/concatenation, pop NFAs from the stack, connecting the popped NFAs to form a new NFA and push back onto the stack again.

I recommend these two excellent articles on NFA. 1. How Regexes Work 2. Regular Expression Matching Can be Simple and Fast

Thompson's algorithm

The difference between Thompson's algorithm and the current backtracking implementation in Python/Ruby/... lies in the treatment for multiple transitions. Backtracking algorithm tracks only one transition at one step (always choose the greedy transition) and backtrack to another route if current route fails, while Thompson's algorithm tracks all possible transitions simultaneously.

Take the above a?a for example, suppose we are matching an input string of a to this NFA. The backtracking algorithm will try the a transition from s0 to s1 first since it is greedy, but this route fails since input is exhausted at non-final state s1, so the algorithm backtracks to s0 and choose the epsilon transition this time, and is able to arrive at the final state s2 correctly. On the other hand, Thompson's algorithm will try to track both a and epsilon transition simultaneously. Before it reads anything, both s0 and s1 are considered as starting states since s0 can transit into s1 without any input. Now after reading the only character a, only s1 will successfully transit into final state s2, and the matching completes.

This means Thompson's algorithm will significantly perform better over some pathological input patterns such as a?a?a?aaa matching aaa, since time complexity is exponential for backtracking algorithm and only linear for Thompson's algorithm.

Code

First step is to convert an input string into post-fix tokens, and add concatenation in the right place. A simple lexer and recursive-descent parser is used for this stage, following the grammar below (for detailed lexing and parsing code please refer to the source code):

Grammar for regex:

regex = exp $

exp      = term [|] exp      {push '|'}
         | term
         |                   empty?

term     = factor term       chain {add \x08}  # \x08 is used as CONCAT token
         | factor

factor   = primary [*]       star {push '*'}
         | primary [+]       plus {push '+'}
         | primary [?]       optional {push '?'}
         | primary

primary  = \( exp \)
         | char              literal {push char}

A very simple class is enough to simulate an NFA:

class State:
    def __init__(self, name):
        self.epsilon = [] # epsilon-closure
        self.transitions = {} # a dictionary of char --> state
        self.name = name
        self.is_end = False # is it the ending state?

Next we read one token at a time, perform relevant NFA transformations and push the transformed NFA onto a stack. For example, when seeing a CONCAT token, pop two NFAs from the stack, chain these two NFAs together and create a new CONCAT NFA, and push the new NFA onto the stack again. When all the tokens are consumed, there should be only one NFA left on the stack. Details on how to construct NFAs can be found either from the two articles cited above, or from the book Compilers: Principles Techniques and Tools

A handler class is used for this step, the class has a dictionary of different handlers for each type of tokens.

class Handler:
    def __init__(self):
        self.handlers = {'CHAR':self.handle_char, 'CONCAT':self.handle_concat,
                         'ALT':self.handle_alt, 'STAR':self.handle_rep,
                         'PLUS':self.handle_rep, 'QMARK':self.handle_qmark}

These steps above form the exposed compile function, which turns an input regex string into an NFA machine.

from parse import Lexer, Parser, Token, State, NFA, Handler

def compile(p, debug = False):

    lexer = Lexer(p)
    parser = Parser(lexer)
    tokens = parser.parse()

    handler = Handler()

    nfa_stack = []

    for t in tokens:
        handler.handlers[t.name](t, nfa_stack)

    assert len(nfa_stack) == 1 # check there is only one final NFA on the stack
    return nfa_stack.pop()

Next we can feed the constructed NFA a string and check whether it is accepted or not. The match function uses two sets to hold current_state and next_state, make each char transitions and epsilon transitions and check if any ending state is arrived:

def match(self,s):
       current_states = set()
       self.addstate(self.start, current_states) # addstate makes epsilon transitions

       for c in s:
           next_states = set()
           for state in current_states:
               if c in state.transitions.keys():
                   trans_state = state.transitions[c]
                   self.addstate(trans_state, next_states)

           current_states = next_states 

       for s in current_states: #check if any ending state reached, if so, return match
           if s.is_end:
               return True
       return False

Performance

Benchmarking with Python's built-in re module, over normal inputs (from Glenn Fowler's test suite, Python performs better, which is as expected since I didn't do any optimization. But the average speed is fast for both algorithms -- less than 1ms.

performance 1

However, when tested with those pathological inputs, Thompson's algorithm is a clear win, specifically Python will hang when n is over 25. Notice the time scale is in seconds rather than in milliseconds in the previous plot.

performance 2

Russ Cox wrote about history behind regex implementation (last section).

Install LLVM OCaml bindings (and UTop)

Xiayun Sun

2014-05-06 23:27

Comments

I put on a Stackoverflow answer here

Build LLVM with configuration: --with-ocaml-libdir =ocamlc -where``
To compile LLVM codes: ocamlc -cc g++ llvm.cma test.ml -o test
To make UTop work:
- create a myutop_main.ml file with: let () = UTop_main.main ()
- create a custom llvmutop: ocamlfind ocamlmktop -o llvmutop -thread -linkpkg -package utop llvm.cma myutop_main.ml -cc g++
If you just want OCaml's build-in toplevel:
- ocamlmktop -o llvmtop llvm.cma -cc g++

Week 20140504

Xiayun Sun

2014-05-01 22:24

Comments

Dan Grossman

Syntax, semantics, idioms, libraries, tools
records: #field_name record_name
tuples are syntax sugers of records (field_name = 1, 2, 3...)
datatype bindings: Constant of int | Add of expr * expr
pattern matching: case x of =>
one-of (Option); each-of (Records)
type synonym: type name = t. type card = suit * rank
polymorphic datatype: datatype 'a option = NONE | SOME of 'a
val p = e. p --> patterns
fun f p = e
exception, raise
e1 handle exn => e2
tail optimization: remove caller from call stack
the wrath of premature optimization

Foundation by Isaac Asimov

(Notes generated by my little Kindle script

If you're born in a cubicle and grow up in a corridor, and work in a cell, and vacation in a crowded sun-room, then coming up into the open with nothing but sky over you might just give you a nervous breakdown.
I don't care an electron
And when Onum Barr stepped into his little garden early the next morning, he found a box at his feet. It contained provisions, concentrated provisions such as one would find aboard ship, and alien in taste and preparation. But they were good, and lasted long.
The whole war is a battle between those two systems, between the Empire and the Foundation; between the big and the little.

Max sub-array problem

def max_sub_array(l):
    max_ending_here = max_so_far = l[0]
    begin = begin_temp = end = l[0]

    for i in range(1, len(l)):
        if l[i] > l[i] + max_ending_here:
            max_ending_here = l[i]
            begin_temp = i
        else:
            max_ending_here += l[i]

        if max_ending_here >= max_so_far:
            max_so_far = max_ending_here
            begin = begin_temp
            end = i

    return max_so_far, begin, end

App Jamming 2014

Xiayun Sun

2014-04-19 23:00

Comments

App Jamming 2014

I was at my friend Michelle's app jamming 2014 event yesterday as a mentor. This is my first such experience and I thought I'd make a note, mostly because it was fun, partly because I really need to pick up blogging again.

(I'll post pictures once I get them)

App jamming is an annual event where we teach teenage girls (mostly age 11-15) to code with the help of MIT App inventor. This year is the second year and around 50 girls signed up.

Most girls have never written a line of code before. Because of this I was a bit reserved about the idea of writing an app, since in my mind writing an app is not necessary the best introduction to programming. But it turned out alright, and in hindsight, I think making an app has its unique advantages:

Better interaction, with UI

MIT app inventor is awesome in that the build process is very simple and elegant -- just scan a QR code and the app would run on an Android phone.

Kids were over the moon when they made the first app, "HelloPurr", where you can pat a cat and it will "meow". The room was filled with "meow"s and I don't think a traditional "Hello world" would have the same effect. (When I wrote my Hello world I emailed the .exe to my parents only to find out they were disappointed, my mom was like, "I thought there would be music!")
Programming with blocks

The app inventor uses the same idea behind Scratch, also from MIT, where you write programs by dragging logic blocks. While I am still a bit suspicious on whether this will foster the right logic thinking, it is indeed a straightforward-yet-not-scaring introduction of programming, since you don't need a text editor at all.

Along the way I've noticed a few interesting things:

Most kids' favorite app is instagram! I've never ever used instagram. Gesus I feel so old.
Most kids' want to be invisible if they could have one superpower.

I...never...thought about that. I only wished I could have more time than others (there was an episode in Doraemon where Nobi can stop the world's time and go to do whatever he wants and restarts world's time again. I remember watching this and thinking, I really really want to do the same.

Is it because parents nowadays are way too harsh on their kids?
Unfortunately, they also have amazingly short attention spans.

I think it is partly due to the iPad-dominated world right now. I try to help on this as much as possible, by never giving answers straight away but rather guide them through the thinking process ("How do you want to make the ants move? What would happen when the hand crashes with the ant? And what's the right function to do that? Do you think collide is a good choice? Why?") But I don't think it helps much, especially during such short time.

Overall it was fun. Kids were a bit quiet in the very first beginning but jumping around soon after they made the kitten meow. There was a moment during the lunch break, where we accidentally realized that while all the mentors were chatting in the lunch area, all the kids were at their table, focused on their app. That was beautiful and precious.

As Michelle put it, we are working hard at bringing more girls (also with the newly founded Women Who Code HK) into programming because we no longer want to be outliers in this industry. And so far things look promising.

(There was a girl at my table who participated last year, and fell in love with programming. This year she is graduating from high school and she is picking computer science as her major. I was talking to her and telling her she would have a great adventure and we were both excited. It's really fulfilling to see we actually made an impact. )

Week 20140420

Xiayun Sun

2014-04-14 21:41

Comments

Advanced technique for web functional testing

test responsive: self.driver.set_window_size
Needle -- test by comparing screenshots
Saucelabs
the selenium guidebook
selenium testing tools cookbook

Web scraping with Python

multiprocessing.Pool
argparse module

Micellaneous

decorators with arguments
manually create swap file

Recently when I try to update the opam on my DigitalOcean server, the below error pop up:

opam: "fork" failed: Cannot allocate memory.

It is due to lack of RAM memory. The solution is to create a swap file. I followed the instructions here, note to compile core a 512MB swap is not enough, I ended up creating a 1G swap. DigitalOcean also has a nice article on creating swap files

Week 20140406

Xiayun Sun

2014-03-31 13:32

Comments

Dan Grossman

functions over lists are usually recursive: empty list; non-empty list (hd, tl)
let expression: let...in...end
Do a recursive computation, store in (local binding)
Option: isSome, valOf
Bool comparisons: andalso, orelse, not
real is not an equality type (subject to round errors)
no mutation --> no need to worry about alias or copies --> implementation more efficient & better security
Done HW1

Brandon Rhodes Pycon talk

Gustavo Duarte -- what your computer does while you wait
Dan Luu -- how misaligning data can increase performance 12x
Python store general purpose data structures (eg. tuple) as array of addresses
append on a list adds extra spaces -- amortization
list is fast for tail operations: append, pop. Not pop(0), insert(0)
dict is an array with keys stored at integer indexes according to hash value:
- hash -- key address -- value address
- PyCon 2010 "The mighty dictionary"
- can build compound keys with tuple
set: dictionary with no keys
deque, heapq
2013 PyCon, the clean architecture in Python

Miscellaneous

rescue time is a neat productivity tool.

Weekly 20140330

Xiayun Sun

2014-03-24 21:09

Comments

Surely You're joking, Mr. Feynman!

Learn what the rest of the world is like. The variety is worthwhile.

"I could do that, but I won't" -- which is just another way of saying that you can't.

You don't have to be responsible for the world that you're in.

You have no responsibility to live up to what other people think you ought to accomplish.

Dan Grossman

expression: syntax, type-checking, evaluate
REPL: Read, Evaluate, Print, Loop; do not type use without restarting REPL
syntax:
- negation: ~5
- division: div
after expression evaluated, the expression producing the result is irrelevant --> do not reuse use in sml
access pair/tuple: #1 e
naming:
- pair <--> pr

Learn C the hard way

switch statement:
- return 1 --> tell OS there's an error
- only integers allowed (eg. char)
use functions
- forward declaration: use function before defining them
pointers

Miscellaneous

I wrote a small utility to manage my "My Clippings.txt" from Kindle. This script will create a separate note for each book.
gist is an awesome command line tool for gist.
use rlwrap sml to enable history rollback through arrow keys

Weekly 20140323

Xiayun Sun

2014-03-20 19:52

Comments

30 Python language features and tricks

Extended unpacking

>>> a, *b, c = [1,2,3,4]
>>> b
[2,3]

List slices with negative step

>>> a = [1,3,4,5]
>>> a[::-1]
[5,4,3,1]

Name slices (slice(start, end, slice))

>>> LASTTHREE = slice(-3, None, 2)
>>> a = [1,2,3,4,5,6]
>>> a[LASTTHREE]
[4,6]

Dictionary comprehension

>>> m = {x: x * 3 for x in range(5)}
>>> m
{0:0, 1:3, 2:6, 3:9, 4:12, 5:15}

Named tuples

>>> Point = collections.namedtuple('Point', ['x', 'y'])
>>> b = Point(11,13)
>>> b.y
13

Use capital letter for set variables
Counter

>>> from collections import Counter
>>> a = Counter([1,1,2,3])
>>> a
Counter({1:2, 2:1, 3:1})
>>> a.most_common(1)
[(1, 2)]

collections.deque Double-ended queue
default dict

>>> m = collections.defaultdict(int)
>>> m['a']
0
>>> m = collections.defaultdict(str)
>>> m['a']
''

Simple URL encryption

Append a KEY string (with an easy delimitor) to the ID field, then use base64.urlsafe_b64encode and base64.urlsafe_b64decode

flaskr-tdd

TDD:
1. write a test
2. run the test (it should fail)
3. write just enough code to pass the test
4. refactor
Flask test client

tester = app.test_client(self)
response = tester.get('/', content_type="html/text")
# response.data include all html text

Flask flash system: get_flashed_messages --> may be useful for client-side validation
url_for('static', filename = 'style.css')

Install tagbar on vim

create .vim/ if not exist
Install exuberant ctags
Install tagbar

cd ~/.vim
git clone git://github.com/majutsushi/tagbar
mv tagbar/* .  # must be directly under .vim/
sudo rm -r tagbar/
cd doc/
vim tagbar.txt
:helptags .
# quit vim

edit .vimrc (below is mine):

let g:tagbar_ctags_bin='/usr/local/bin/ctags' # change to ctags bin path, this is on MacOS
let g:tagbar_width=40
noremap <silent> tb :TagbarOpen fj <CR>

useful key shortcuts:

tb: open tagbar # my key binding
p: go to tag, stay in tagbar
<enter>: go to tag
o: toggle fold
*: open all fold
=: collapse all fold
?: keymap help

2013

Xiayun Sun

2014-02-24 23:18

Comments

年终奖到账, 交了离职信, 真正就和过去一年说再见了.

2013年是我的本命年, 真的是曲折起伏波澜壮阔. 从年初工作压力大不顺心一直到4月跌到谷底，到拿到了HackerSchool的录取去纽约玩儿了三个月，期间写代码功力见长但也并不是一帆风顺，回到香港后慢慢地把自己的路子从金融扭到工程，而今再过一个月我所有的工资就将来自于Python, 同时我和其他3个小伙伴一起成立了Women Who Code HK, 手里还有几个项目在参与. 这些事情都是4月前做梦也想不到的。

全年目标回顾

读书：10本计算机，12本非计算机(6本虚构6本非虚构，包括多于2本中文书)

完成程度：读了8本计算机，1本虚构(福尔摩斯), 2本非虚构(Mastery, 中国近代史) 计算机书读得多了别的书精力就少了...

编程方面:
- 阅读一个开源项目代码 --> 小项目读了不少，大项目没有。这也是今年打算加强的。
- 1万行代码 --> 完成。数了数大概是1.5万行，Python和OCaml一半一半。今年到目前为止已经写了3千行了....
- 在GitHub上被打星星/有fork --> 完成。目前一共8颗星。不过纽约回来后忙得没有再启动新的个人项目。是时候立项了。
- 活跃在一个邮件列表(mailing list)/社团(community) --> 失败
- 用代码赚钱--> 完成
生活方面:
- 旅游: --> 去了新西兰和纽约
- 存钱: --> 多谢年终奖，即使停薪三个月依然完成目标。
- 换一份工作 --> 完成
- 小提琴500小时，进乐团 --> 失败。花在小提琴上面的时间少了。不过生活的重心本来也转移了。
- 学会游泳 --> 连续第二年失败....
- 跑100次步 --> 跑步33次200公里，爬山2次40公里，徒手锻炼23次。和去年相比大幅下滑。
- 双休日10点前起床 --> 纽约前失败了10次。纽约后全部做到，因为实在是太忙了，有时候想睡懒觉在床上用手机看下邮件就默默爬起来洗澡干活去了....
年度目标总结

1年前写这份目标的时候还一边写一边笑话自己自不量力。回过头来看真的是梦想有多大可能性就有多大。譬如我曾经觉得编程方面我一个都实现不了，结果回过头来看编程方面竟然完成率几乎100%。这要是回到1年前告诉那时的自己都不会相信的吧。

HackerSchool给了我什么

如果写一个HackerSchool之后我做了些什么会是一个牛逼闪闪的单子:

换了工作
同时给某个初创企业写Python后台
成立了Women Who Code HK
参与了一些国际化的开源项目，经常半夜和不同时区的小伙伴们一起开会。

但这些都是表象，我清楚背后HackerSchool真正给我的是这些：

无与伦比的人脉。认识了一群也许是未来十几年甚至几十年美国和一些其他地区程序界的中坚力量。
养成了良好的编程和讨论习惯。保持了好奇心。
学会分享。
不再害怕面对觉得太学术太难的问题。
以及最重要的，学会去做一个无论多么牛逼闪闪都依然善良耐心的正人君子。

这些得着是对无论以后做什么都有益的。

我为什么离开DB

基本就是这几个原因:

市场风险不是我的长期理想职业方向, 如果还在金融业的话我希望自己转到前台做交易员。而新的工作离这个目标更接近。
三年下来觉得我已经像个吸饱了水的海绵学习曲线已经放平了，每天的工作都在自己的comfort zone里面并没有太大挑战。我承认小队长的工作看起来比较有趣，但我是否愿意继续混十几年做到他这个地方呢？我觉得这十几年花在别的地方成就会更大。
在这个团队做得不是很开心，说狠一点就是经常被猪一样的队友噎死。
我最怕的是以后变成那些人到中年没有别的本事只能在后台混日子还感觉良好的大多数人。

递信之后意外地得到了许多老大级别人物的挽留(亚洲区市场风险老大形容我是one of the brightest youngsters I've ever worked with -- 我共事过的最聪明的年轻人之一)，经常被各种MD叫到小屋子里去谈话，这至少也证明了三年里我的工作得到了肯定。这样离开也没有遗憾了。

也收到了一些所谓counter-offer, 但是敝行给出的职位我都不是很满意，我最想去的衍生品交易没有空位，然后给在纽约认识的前瑞银某MD写了个邮件，他回过来说:

It's been pointed out recently that "do what you love" is obnoxious as a general philosophy, because the vast majority of the world doesn't have that option. But if you do have the option, then it's a no-brainer.

(有人指出"要做你热爱的事情"是无稽之谈因为这个世界上的多数人都没有这个选择。然而如果你真的有这个选项，根本不用犹豫。)

然后前两天晚上又和前教授新老板吃了个晚饭，一起聊未来的技术发展聊得很开心。

所以就这样决定了吧。我总是说30岁前我不打算正经赚钱，我在德银也算攒到了第一笔金，是时候趁还输得起的年纪去别的地方闯一闯了。

总结

今年是毕业参加工作的第三年。如果第一年是初入职场的兴奋，第二年是轻车熟路，第三年就是临界点了。这一年是目前为止最辛苦的一年，也是收获最大的一年，我很高兴付出都有回报，而梦想依然在路上。

当然了，要成为"本年龄段本性别里最好的程序员", 要实现给航天飞机和电影写代码的理想，要一项项勾掉"人生100个目标"那张单子，我才刚走过起点呢。

Always remember the future comes one day a time.

永远记住，未来是一天天来到的。

Miguel's flask mega tutorial

Xiayun Sun

2014-02-23 17:01

Comments

Link

Notes

flask:
- Use url_for to build and reference URLs
- app.before_request, app.after_request (after_request can be used to measure database query performance)
- Rotating file handler
- Don't use window.history.back in forms
- Move logic away from views
- Better return query objects than query results
- Pagination (yield_per in sqlalchemy)
- Unit test:
  - use a different test database
  - regression testing: write a test for every new bug
- Debugging: pdb.set_trace()
Jinja:
- Global variables/utilities like config, g, session, url_for are available in Jinja templates
- render_template can be used in rendering HTML email templates
Useful plugins: coverage, cProfile, werkzeug profiler

Code snippets

pass variable in URL:

python @app.route('/index/<int:page>') def turnto(page):
use include in control statements in Jinja:

{% for post in posts %} {% include "post.html" %} {% endfor %}
rollback when error occurred:

python @app.errorhandler(500) def err500(): dbsession.rollback()
Python's fake mail server: python -m smtpd -n -c DebuggingServer localhost:1025
Asynchoronous decorator:

```python from threading import Thread from functools import wraps

def async(f): @wraps(f) def wrapper(args, *kwargs): thr = Thread(target = f, args = args, kwargs = kwargs) thr.start() return wrapper ```
Expose new global variables in Jinja:

python app.jinja_env.globals['momentjs'] = momentjs