1 * '1' = '1'

There are some days I wish Python is strongly typed. And today is one of them.


I work on an algorithm trading desk. Our algorithm trading engine is in Python. For a few days my team has been working on a new feature which can assign a "weight" to each trading strategy, so for example if a strategy has a weight of 2, then each trading order it sends will automatically has its size doubled up.

Last night the development work was done. I run the tests, merged back to development branch, and deployed to our simulation environment.

This morning when I came into office, I was greeted by an error: basically it was shouting at me that the volume field in an order should be a number, not a string.

This has never happened before.

What happened

My first instinct is I could just add a simple int() wrapper around the volume field and back to my breakfast. Luckily I didn't and instead sat down to find the bug.

It turns out as we're reading the weights from a config file, but without properly converting it into an integer, so the weight stays as a string and everytime it's applied to order volume, Python won't complain about type mismatch, instead something like below happens:

1 * '1' = '1'

And an order with size '1', not number 1 is sent.

Why didn't my tests fail?

We have both unit tests and integration tests (where a dummy strategy will run throught the whole market data --> trading signal --> trading execution process).

In this case I didn't update the integration tests to include the weight. And in our code the default weight is 1 (NOTE: not '1'), so the tests happily passed.


Just add an int() coersion when reading the weight from the config file.

I also included the weight field in my tests.

The real danger

It only occurred to me in the afternoon that if this has not been properly fixed, real disaster could happen:

Imagine a strategy with weight = 1 emits a trading signal with size = 3:

3 * '1' = '111'
int('111') = 111  #ahhhh fat finger

Now I don't even want to think about what 5 * '1' would turn out.


  • NEVER blindly coerce types.
  • Always update tests with new features.
  • It's important to trace down to the root cause of a bug.

Running automated selenium tests on Debian server

In the startup I work we use selenium for automated web testing. Previously the testing flow involves manually open the selenium plugin in firefox, set the environment-specific variables and click the "Play entire test suite" button. Recently I migrated the task to a Debian server which frees us from manual work and enables us to do much more, eg. test scheduling, integrating to the deployment process, etc.

Below is my notes on achieving this.

Running selenium headless

Selenium normally requires a display. On a Linux server, I use xvfb to simulate the display. The command is:

xvfb-run --server-num=10 <python commands here>


xvfb-run --server-num=10 python -m unittest discover --pattern=*.py

I have also found out that the method driver.find_element_by_link_text often fails with the message "cannot scroll into view". I have tried the fix with maximize_window and window.scrollTo(x, y) but with no luck. In the end I have to replace some of the clicks with driver.get(url) calls.

POST request in logged in session

Selenium does not support POST request naturally (if it does please let me know!), this is a bit of headache when some jquery POST calls need to be tested within a logged in session.

Luckily we use cookies to verify logged in sessions and thus can have the super awesome requests library to the rescue.

The tricky part(I am not sure if it is a universal case) is that when I login through POST request to our authenticate service, the returned cookie cannot be found in the request.cookies, rather it is part of the r.request.headers['Cookie'].

So for example to test the like button in a post (something similar to the facebook like button):

# step1 login
r = request.post(authen_url, data = {'username': username, 'password': password})

# get cookie from headers 
# note r.request.headers['Cookie'] is a comma separated string
# eg: name1=val1; name2=val2
cookie_str = r.request.headers['Cookie']
cookies = {}
for c in cookie_str.split(';'):
    cname = c.split('=')[0]
    cval = c.split('=')[1]
    cookies[cname] = cval

# step2 test other functions with the obtained cookie
# eg. a "like" POST function here
u = 'domain.com/service/likefeed'
d = {'feedid':feedid}
r = requests.post(u, data = d, cookies = cookies)

# step3 use selenium to test frontend effect, eg. a thumbs up symbol
# selenium webdriver functions here

Supporting different environments

We have 3 sets of development environments (demo --> pilot --> production). To switch among these 3 seamlessly, I created a config.ini file writing down the environment-specific variables (eg. base URL, test account login, test post contents, etc.) for each environment, and pass this config (using the excellent configobj module) when initializing each test suite:

So I have a utility function like below:


from configobj import ConfigObj

def read_env(env):
    c = ConfigObj('config.ini')
    return env, c[env + '_baseurl'], c[env + '_account'], c[env + '_passwd']

Then in each test suite script I add a customized __init__ call that will initialize properties to respective values according to the passed environment, these values can then be used in later testing functions:

For example:

# test_suite_1.py

from utils import read_env

class TestSuite1(unittest.TestCase):
    def __init__(self, env):
        super(TestSuite1, self).__init__('test_case_1')
        self.env, self.base_url, self.account, self.passwd = read_env(en)

Then I have a master file controlling the complete test run:

# run.py

from test_suite_1 import TestSuite1

if __name__ == '__main__':
    env = sys.argv[1]
    suite = unittest.TestSuite()

    suite.addTest(TestSuite1(env)) # initialize with specific env


To make things easier I also add a bash alias in .bash_aliases:

alias mftest = 'xvfb-run --server-num=10 python /path/to/tests/run.py'

In this way the complete test process can be kicked off by simply running mftest <env> where <env> can be either demo or pilot or prod, and the program will pick up the right config.

How to insert images into posts in Nikola

Since it took me a surprisingly long time to solve this simple problem -- it's not in Nikola's official docs(where they only have a section on galleries) and my Google search was unfruitful, I thought I'd write it down.

The simple trick is based on the fact that Nikola will automatically copy all contents under files/ folder to output/ folder when you run nikola build.

So here are the steps:

  1. Create a subdirectory in files/. Say regex_pic/ (where I put my images for the previous post on parsing regex) -- this is more about better practice, since I don't like to clutter my output/ folder.

  2. Put your images into that folder.

  3. In your post, use regular markdown syntax to reference those images. For example: ![simple image1](/regex_pic/simple1.jpg) will insert regex_pic/simple1.jpg.

  4. nikola build. Done.

Regex parsing: Thompson's algorithm

This is a repost of my previous blog article at the old domain.

Source code can be found here


The central idea behind a regex engine is non-deterministic automata, NFA. While the name is scary, it is just a state machine which can transit to other state machines on certain characters or nothing (which is called an "epsilon transition")

For example, below is an NFA for the simple concatenation regex ab:

simple regex1

s0 is the start state. It can accept one character 'a' and transform to state s1. Similarly s1 can only accept one character 'b' and transfer to state s2. s2 is the final state, if we have exhausted the input string and the current state is a final state, the pattern matching is successful.

Below is a slightly more complicated example, matching a?a:

simple regex2

Here the transition with an epsilon means a state can transform into the next state without any input. Therefore this combined NFA matches both aa and a, i.e. a?a.

To build a regex engine is just to build an NFA corresponding to the input regex pattern, which can be done similarly as parsing arithmetic expressions, i.e. push individual NFAs for individual characters onto a stack, upon seeing an operator, eg star/question mark/alternation/concatenation, pop NFAs from the stack, connecting the popped NFAs to form a new NFA and push back onto the stack again.

I recommend these two excellent articles on NFA. 1. How Regexes Work 2. Regular Expression Matching Can be Simple and Fast

Thompson's algorithm

The difference between Thompson's algorithm and the current backtracking implementation in Python/Ruby/... lies in the treatment for multiple transitions. Backtracking algorithm tracks only one transition at one step (always choose the greedy transition) and backtrack to another route if current route fails, while Thompson's algorithm tracks all possible transitions simultaneously.

Take the above a?a for example, suppose we are matching an input string of a to this NFA. The backtracking algorithm will try the a transition from s0 to s1 first since it is greedy, but this route fails since input is exhausted at non-final state s1, so the algorithm backtracks to s0 and choose the epsilon transition this time, and is able to arrive at the final state s2 correctly. On the other hand, Thompson's algorithm will try to track both a and epsilon transition simultaneously. Before it reads anything, both s0 and s1 are considered as starting states since s0 can transit into s1 without any input. Now after reading the only character a, only s1 will successfully transit into final state s2, and the matching completes.

This means Thompson's algorithm will significantly perform better over some pathological input patterns such as a?a?a?aaa matching aaa, since time complexity is exponential for backtracking algorithm and only linear for Thompson's algorithm.


First step is to convert an input string into post-fix tokens, and add concatenation in the right place. A simple lexer and recursive-descent parser is used for this stage, following the grammar below (for detailed lexing and parsing code please refer to the source code):

Grammar for regex:

regex = exp $

exp      = term [|] exp      {push '|'}
         | term
         |                   empty?

term     = factor term       chain {add \x08}  # \x08 is used as CONCAT token
         | factor

factor   = primary [*]       star {push '*'}
         | primary [+]       plus {push '+'}
         | primary [?]       optional {push '?'}
         | primary

primary  = \( exp \)
         | char              literal {push char}

A very simple class is enough to simulate an NFA:

class State:
    def __init__(self, name):
        self.epsilon = [] # epsilon-closure
        self.transitions = {} # a dictionary of char --> state
        self.name = name
        self.is_end = False # is it the ending state?

Next we read one token at a time, perform relevant NFA transformations and push the transformed NFA onto a stack. For example, when seeing a CONCAT token, pop two NFAs from the stack, chain these two NFAs together and create a new CONCAT NFA, and push the new NFA onto the stack again. When all the tokens are consumed, there should be only one NFA left on the stack. Details on how to construct NFAs can be found either from the two articles cited above, or from the book Compilers: Principles Techniques and Tools

A handler class is used for this step, the class has a dictionary of different handlers for each type of tokens.

class Handler:
    def __init__(self):
        self.handlers = {'CHAR':self.handle_char, 'CONCAT':self.handle_concat,
                         'ALT':self.handle_alt, 'STAR':self.handle_rep,
                         'PLUS':self.handle_rep, 'QMARK':self.handle_qmark}

These steps above form the exposed compile function, which turns an input regex string into an NFA machine.

from parse import Lexer, Parser, Token, State, NFA, Handler

def compile(p, debug = False):

    lexer = Lexer(p)
    parser = Parser(lexer)
    tokens = parser.parse()

    handler = Handler()

    nfa_stack = []

    for t in tokens:
        handler.handlers[t.name](t, nfa_stack)

    assert len(nfa_stack) == 1 # check there is only one final NFA on the stack
    return nfa_stack.pop() 

Next we can feed the constructed NFA a string and check whether it is accepted or not. The match function uses two sets to hold current_state and next_state, make each char transitions and epsilon transitions and check if any ending state is arrived:

def match(self,s):
       current_states = set()
       self.addstate(self.start, current_states) # addstate makes epsilon transitions

       for c in s:
           next_states = set()
           for state in current_states:
               if c in state.transitions.keys():
                   trans_state = state.transitions[c]
                   self.addstate(trans_state, next_states)

           current_states = next_states 

       for s in current_states: #check if any ending state reached, if so, return match
           if s.is_end:
               return True
       return False


Benchmarking with Python's built-in re module, over normal inputs (from Glenn Fowler's test suite, Python performs better, which is as expected since I didn't do any optimization. But the average speed is fast for both algorithms -- less than 1ms.

performance 1

However, when tested with those pathological inputs, Thompson's algorithm is a clear win, specifically Python will hang when n is over 25. Notice the time scale is in seconds rather than in milliseconds in the previous plot.

performance 2

Russ Cox wrote about history behind regex implementation (last section).

Install LLVM OCaml bindings (and UTop)

I put on a Stackoverflow answer here

  1. Build LLVM with configuration: --with-ocaml-libdir =ocamlc -where``

  2. To compile LLVM codes: ocamlc -cc g++ llvm.cma test.ml -o test

  3. To make UTop work:

    • create a myutop_main.ml file with: let () = UTop_main.main ()

    • create a custom llvmutop: ocamlfind ocamlmktop -o llvmutop -thread -linkpkg -package utop llvm.cma myutop_main.ml -cc g++

  4. If you just want OCaml's build-in toplevel:

    • ocamlmktop -o llvmtop llvm.cma -cc g++

Week 20140504

Dan Grossman

  • Syntax, semantics, idioms, libraries, tools
  • records: #field_name record_name
  • tuples are syntax sugers of records (field_name = 1, 2, 3...)
  • datatype bindings: Constant of int | Add of expr * expr
  • pattern matching: case x of =>
  • one-of (Option); each-of (Records)
  • type synonym: type name = t. type card = suit * rank
  • polymorphic datatype: datatype 'a option = NONE | SOME of 'a
  • val p = e. p --> patterns
  • fun f p = e
  • exception, raise
  • e1 handle exn => e2
  • tail optimization: remove caller from call stack
  • the wrath of premature optimization

Foundation by Isaac Asimov

(Notes generated by my little Kindle script

  • If you're born in a cubicle and grow up in a corridor, and work in a cell, and vacation in a crowded sun-room, then coming up into the open with nothing but sky over you might just give you a nervous breakdown.
  • I don't care an electron
  • And when Onum Barr stepped into his little garden early the next morning, he found a box at his feet. It contained provisions, concentrated provisions such as one would find aboard ship, and alien in taste and preparation. But they were good, and lasted long.
  • The whole war is a battle between those two systems, between the Empire and the Foundation; between the big and the little.

Max sub-array problem

def max_sub_array(l):
    max_ending_here = max_so_far = l[0]
    begin = begin_temp = end = l[0]

    for i in range(1, len(l)):
        if l[i] > l[i] + max_ending_here:
            max_ending_here = l[i]
            begin_temp = i
            max_ending_here += l[i]

        if max_ending_here >= max_so_far:
            max_so_far = max_ending_here
            begin = begin_temp
            end = i

    return max_so_far, begin, end

App Jamming 2014

App Jamming 2014

I was at my friend Michelle's app jamming 2014 event yesterday as a mentor. This is my first such experience and I thought I'd make a note, mostly because it was fun, partly because I really need to pick up blogging again.

(I'll post pictures once I get them)

App jamming is an annual event where we teach teenage girls (mostly age 11-15) to code with the help of MIT App inventor. This year is the second year and around 50 girls signed up.

Most girls have never written a line of code before. Because of this I was a bit reserved about the idea of writing an app, since in my mind writing an app is not necessary the best introduction to programming. But it turned out alright, and in hindsight, I think making an app has its unique advantages:

  • Better interaction, with UI

    MIT app inventor is awesome in that the build process is very simple and elegant -- just scan a QR code and the app would run on an Android phone.

    Kids were over the moon when they made the first app, "HelloPurr", where you can pat a cat and it will "meow". The room was filled with "meow"s and I don't think a traditional "Hello world" would have the same effect. (When I wrote my Hello world I emailed the .exe to my parents only to find out they were disappointed, my mom was like, "I thought there would be music!")

  • Programming with blocks

    The app inventor uses the same idea behind Scratch, also from MIT, where you write programs by dragging logic blocks. While I am still a bit suspicious on whether this will foster the right logic thinking, it is indeed a straightforward-yet-not-scaring introduction of programming, since you don't need a text editor at all.

Along the way I've noticed a few interesting things:

  • Most kids' favorite app is instagram! I've never ever used instagram. Gesus I feel so old.
  • Most kids' want to be invisible if they could have one superpower.

    I...never...thought about that. I only wished I could have more time than others (there was an episode in Doraemon where Nobi can stop the world's time and go to do whatever he wants and restarts world's time again. I remember watching this and thinking, I really really want to do the same.

    Is it because parents nowadays are way too harsh on their kids?

  • Unfortunately, they also have amazingly short attention spans.

    I think it is partly due to the iPad-dominated world right now. I try to help on this as much as possible, by never giving answers straight away but rather guide them through the thinking process ("How do you want to make the ants move? What would happen when the hand crashes with the ant? And what's the right function to do that? Do you think collide is a good choice? Why?") But I don't think it helps much, especially during such short time.

Overall it was fun. Kids were a bit quiet in the very first beginning but jumping around soon after they made the kitten meow. There was a moment during the lunch break, where we accidentally realized that while all the mentors were chatting in the lunch area, all the kids were at their table, focused on their app. That was beautiful and precious.

As Michelle put it, we are working hard at bringing more girls (also with the newly founded Women Who Code HK) into programming because we no longer want to be outliers in this industry. And so far things look promising.

(There was a girl at my table who participated last year, and fell in love with programming. This year she is graduating from high school and she is picking computer science as her major. I was talking to her and telling her she would have a great adventure and we were both excited. It's really fulfilling to see we actually made an impact. )

Week 20140420

Advanced technique for web functional testing

  • test responsive: self.driver.set_window_size
  • Needle -- test by comparing screenshots
  • Saucelabs
  • the selenium guidebook
  • selenium testing tools cookbook

Web scraping with Python

  • multiprocessing.Pool
  • argparse module


  • decorators with arguments

  • manually create swap file

    Recently when I try to update the opam on my DigitalOcean server, the below error pop up:

    opam: "fork" failed: Cannot allocate memory.

    It is due to lack of RAM memory. The solution is to create a swap file. I followed the instructions here, note to compile core a 512MB swap is not enough, I ended up creating a 1G swap. DigitalOcean also has a nice article on creating swap files

Week 20140406

Dan Grossman

  • functions over lists are usually recursive: empty list; non-empty list (hd, tl)
  • let expression: let...in...end
  • Do a recursive computation, store in (local binding)
  • Option: isSome, valOf
  • Bool comparisons: andalso, orelse, not
  • real is not an equality type (subject to round errors)
  • no mutation --> no need to worry about alias or copies --> implementation more efficient & better security
  • Done HW1

Brandon Rhodes Pycon talk

  • Gustavo Duarte -- what your computer does while you wait
  • Dan Luu -- how misaligning data can increase performance 12x
  • Python store general purpose data structures (eg. tuple) as array of addresses
  • append on a list adds extra spaces -- amortization
  • list is fast for tail operations: append, pop. Not pop(0), insert(0)
  • dict is an array with keys stored at integer indexes according to hash value:
    • hash -- key address -- value address
    • PyCon 2010 "The mighty dictionary"
    • can build compound keys with tuple
  • set: dictionary with no keys
  • deque, heapq
  • 2013 PyCon, the clean architecture in Python


Weekly 20140330

Surely You're joking, Mr. Feynman!

Learn what the rest of the world is like. The variety is worthwhile.

"I could do that, but I won't" -- which is just another way of saying that you can't.

You don't have to be responsible for the world that you're in.

You have no responsibility to live up to what other people think you ought to accomplish.

Dan Grossman

  • expression: syntax, type-checking, evaluate
  • REPL: Read, Evaluate, Print, Loop; do not type use without restarting REPL
  • syntax:
    • negation: ~5
    • division: div
  • after expression evaluated, the expression producing the result is irrelevant --> do not reuse use in sml
  • access pair/tuple: #1 e
  • naming:
    • pair <--> pr

Learn C the hard way


  • I wrote a small utility to manage my "My Clippings.txt" from Kindle. This script will create a separate note for each book.

  • gist is an awesome command line tool for gist.

  • use rlwrap sml to enable history rollback through arrow keys