I am working on a small script which combines different parts of the Ordinance Survey OpenData into a single file, to produce for example a complete outline of the UK from all of the different map tiles the OS provides it's data in. I hope to post some more parts of the script and maybe a full listing over the next day or so, as I am using it as an opportunity to learn some new python tricks.
The first task I came across was filtering out the files that I wanted to merge and storing them in a list. I automatically wrote this statement inside a for loop:
Which looped through all the folders and anything with 'AdministrativeBoundary' in the name is concatenated to the base path and added to the list. However, I wondered if this was the most efficient way of comparing strings, when Python's re module is available. I swapped the line with the equivalent regex:
Doubtless there are better ways of doing this whole operation, which I may return to in future posts but for now I simply wanted to know which of these two string comparisons produced the fastest output. Because this is Python, there is a great little module for this called
timeit.
At its simplest level, the timeit module can run a single command, in this case a list comprehension converting the string
hello world to upper case and output the execution time:
This time, given in seconds, is the time taken for 1000000 iterations of the command to be run and this number of iterations can be set as follows:
This is great if you need to compare two single line statements, or are willing to spend the time condensing a block of code into a single line statement, but more often you will want to time the execution of a block of code. This is performed by creating a module to test and calling it from timeit:
This has the same result as the single line statements but now imports the TestModule and executes it a defined number of times.
To try and control for timing errors and other operating system processes getting in the way of an accurate result, it is a good idea to run the timeit() command several times to get a range of results. This could be done with a loop, but there is a convenience funtcion called repeat() which does this for you, this works in the same way as the timeit() command, but takes two arguments; repeat, the number of times to loop through the test and number, as discussed above:
This returns a list of results instead of a single float, upon which all manner of tests could be performed. But before I broke out scipy and matplotlib I read this in the timeit documentation:
Note: It’s tempting to calculate mean and standard deviation from the result vector and report these. However, this is not very useful. In a typical case, the lowest value gives a lower bound for how fast your machine can run the given code snippet; higher values in the result vector are typically not caused by variability in Python’s speed, but by other processes interfering with your timing accuracy. So the min() of the result is probably the only number you should be interested in. After that, you should look at the entire vector and apply common sense rather than statistics.
So the sensible thing to do is to take the minimum value found in the list and consider that the optimal benchmark execution time for a code block.
Back to the question posed earlier: what is faster, Python's in function or the equivalent regex? Using what I just learned about the timeit module I copied the two chunks of code into a new file and wrapped them up as two modules, called them both through timeit and repeated the tests 20 times to get two minimum execution times:
Upon running the code the results were a bit disappointing, there is very little difference between the two methods, suggesting that I should stick with in and remove the import of the regex library to keep things simple.