I came across this page today while installing SciPy, which I am using to plot 3D shapes in matplotlib.
It is a collection of windows binaries for many useful python modules, and has proven very useful:
Unofficial Windows Binaries for Python Extension Packages
Tuesday, 4 December 2012
Friday, 2 November 2012
Random Shapefile Data
I often need to quickly generate some random attribute values in a shapefile, either to test code, write an answer on GIS.SE or to mock up some data to test a workflow. There are a few ways to do this simply in pure python using the various random functions, but if all I need to do is populate an attribute table the ArcGIS field calculator will do the job.
The syntax of the command is very un-pythonic, but does allow for a range of different random datasets to be generated. I recently used this command to create some percentage data at work:
Which clearly creates random integers between 0 and 100. After reading through the helpfiles I found a list of all the different random distributions that can be used, which I have reproduced below for my own convenience:
The syntax of the command is very un-pythonic, but does allow for a range of different random datasets to be generated. I recently used this command to create some percentage data at work:
Which clearly creates random integers between 0 and 100. After reading through the helpfiles I found a list of all the different random distributions that can be used, which I have reproduced below for my own convenience:
- UNIFORM {Minimum}, {Maximum}—A uniform distribution with a range defined by the {Minimum} and {Maximum}. Both the {Minimum} and {Maximum} are of type double. The default values are 0.0 for {Minimum} and 1.0 for {Maximum}.
- INTEGER {Minimum}, {Maximum}—An integer distribution with a range defined by the {Minimum} and {Maximum}. Both the {Minimum} and {Maximum} are of type long. The default values are 1 for {Minimum} and 10 for {Maximum}.
- NORMAL {Mean}, {Standard Deviation}—A Normal distribution with a defined {Mean} and {Standard Deviation}. Both the {Mean} and {Standard Deviation} are of type double. The default values are 0.0 for {Mean} and 1.0 for {Standard Deviation}.
- EXPONENTIAL {Mean}—An Exponential distribution with a defined {Mean}. The {Mean} is of type double. The default value for {Mean} is 1.0.
- POISSON {Mean}—A Poisson distribution with a defined {Mean}. The {Mean} is of type double. The default value for {Mean} is 1.0.
- GAMMA {Alpha}, {Beta}—A Gamma distribution with a defined {Alpha} and {Beta}. Both the {Alpha} and {Beta} are of type double. The default values are 1.0 for {Alpha} and 1.0 for {Beta}.
- BINOMIAL {N}, {Probability}—A Binomial distribution with a defined {N} and {Probability}. The {N} is of type long, and {Probability} is of type double. The default values are 10 for {N} and 0.5 for {Probability}.
- GEOMETRIC {Probability}—A geometric distribution with a defined {Probability}. The {Probability} is of type double. The default value for {Probability} is 0.5.
- NEGATIVE BINOMIAL {N}, {Probability}—A negative binomial distribution with a defined {N} and {Probability}. The {N} is of type long, and {Probability} is of type double. The default values are 10 for {N} and 0.5 for {Probability}.
There are, of course, many other ways to do this but I find that this weird little function does a pretty good job, once you get to grips with the odd syntax.
Wednesday, 24 October 2012
GDAL Contour
I wanted to create some contours from a DEM file today and decided to use GDAL instead of Arc or FME and I was again pleasantly surprised by the simplicity and power of the GDAL environment. My first port of call was to read the documentation, which showed that this command's syntax does not differ greatly from
the rasterize command discussed in the last post.
The switches do the following:
the rasterize command discussed in the last post.
I decided to generate a set of 10m contours as a shapefile using the following command:
- -b 1 selects the band of the image to process, which defaults to 1
- -a elevation is the name of the contour elevation attribute which will be created
- -snodata -9999 tells GDAL the value of nodata cells in the input raster, so they can be ignored
- ns67ne.tif contour.shp are the input and output files, respectively
- -i 10 is the spacing between each contour
Monday, 22 October 2012
GDAL Rasterize
I wanted to convert a vector outline of the GB coastline into a raster in order to use it as a background layer in MapGuide Maestro. I would usually do such a process in ArcMap, but I am trying to learn open source alternatives to these types of functions. As a result I have been playing around with GDAL and OGR for a week or so now, and have been very impressed with their power and versatility, I only wish I could run them at a UNIX shell at work instead of at the windows command line. With these two packages, their python bindings, numpy, pyshp and matplotlib I think I could begin to move away from ArcGIS without too much pain.
Version 1.8 and up of GDAL support the creation of output raster files when running the GDAL_RASTERIZE command, which makes this operation a single line process, there are a number of switches and options which can be used but here I will only go through the switches used for this process. The command I used to convert the vector outline to a raster file was:
The individual switches operate as follows:
This is just scraping the surface of this one command but serves as a starting point for any future conversions I have to perform using GDAL.
Version 1.8 and up of GDAL support the creation of output raster files when running the GDAL_RASTERIZE command, which makes this operation a single line process, there are a number of switches and options which can be used but here I will only go through the switches used for this process. The command I used to convert the vector outline to a raster file was:
The individual switches operate as follows:
- -a ID Indicates the attribute to be used as the key when converting the vector file. This can be extended using an SQL statement and the -SQL switch to allow for selections of only parts of a file to be converted.
- -l GB_Fix Is the name of the layer to be converted
- -a_nodata -9999 Assigns a raster nodata value of -9999
- -a_srs EPSG:3857 Sets the coordinate system using an EPSG code
- -tr 1000 1000 Sets the raster cell size, the smaller these values are the bigger the output file will be. I initially ran the command with 10 10 and got a file of over 15Gb
- -ot Float32 The datatype of the output file, either Byte, Float32 or Float64 depending on the file format
- GB_Fix.shp dest.tif The input and output filenames, including extensions
This is just scraping the surface of this one command but serves as a starting point for any future conversions I have to perform using GDAL.
Labels:
arcmap,
conversion,
GDAL,
OGR,
rasterize
Friday, 12 October 2012
Python os.walk
In my processing of OS OpenData into a useful format, I realized that I needed to traverse a lot of directories from a root folder in order to get to the files I wanted, as each map tile is contained in its own folder and these tiles are spread across a number of data CDs. The solution, as with many things in GIS, is to use Python, or more specifically the OS.Walk module.
The module returns a triplet of values: a string for the current directory, a list of the directories within the walked path and a list of the files within the walked path. As with all tuples/triplets these can be iterated through using for loops to access the files and folders at each stage of the walk.
The syntax for this is as follows:
Which will iterate through the files and folders in the provided path and print the paths, directories and files to the screen. From here it is possible to test for files of a particular extension or filename by replacing the print statements with conditional statements or other operations.
The module returns a triplet of values: a string for the current directory, a list of the directories within the walked path and a list of the files within the walked path. As with all tuples/triplets these can be iterated through using for loops to access the files and folders at each stage of the walk.
The syntax for this is as follows:
Which will iterate through the files and folders in the provided path and print the paths, directories and files to the screen. From here it is possible to test for files of a particular extension or filename by replacing the print statements with conditional statements or other operations.
Labels:
python
PyShp Attribute Types and Point files
A quick note on attribute types when using pyshp, these have always seemed really clunky to me with a lot of trial and error needed to get things working, but I came across the xbase documentation which is the format used to store the attributes in shapefiles and this document is a treasure trove of information.
The following information is of particular relevance in terms of storing geodata data as attributes, which I could never get right before:
The following information is of particular relevance in terms of storing geodata data as attributes, which I could never get right before:
- C is ASCII characters
- N is a double precision integer limited to around 18 characters in length
- D is for dates in the YYYYMMDD format, with no spaces or hyphens between the sections
- F is for floating point numbers with the same length limits as N
- L is for logical data which is stored in the shapefile's attribute table as a short integer as a 1 (true) or a 0 (false). The values it can receive are 1, 0, y, n, Y, N, T, F or the python builtins True and False
Each of these datatypes is stored in an attribute created by pyshp using the code:
Where the two numbers are the length and the precision of the numerical value being stored. These are not needed for integers and non numeric datatypes, but a length of 1 is needed for logical data.
I have written a small code sample which creates data from lists to show the different datatypes being used correctly in pyshp to create points:
I have written a small code sample which creates data from lists to show the different datatypes being used correctly in pyshp to create points:
Hopefully I will have time soon to write up examples for polygons and lines and more complex use cases of pyshp.
Thursday, 11 October 2012
Python's Timeit Module
I am working on a small script which combines different parts of the Ordinance Survey OpenData into a single file, to produce for example a complete outline of the UK from all of the different map tiles the OS provides it's data in. I hope to post some more parts of the script and maybe a full listing over the next day or so, as I am using it as an opportunity to learn some new python tricks.
The first task I came across was filtering out the files that I wanted to merge and storing them in a list. I automatically wrote this statement inside a for loop:
Which looped through all the folders and anything with 'AdministrativeBoundary' in the name is concatenated to the base path and added to the list. However, I wondered if this was the most efficient way of comparing strings, when Python's re module is available. I swapped the line with the equivalent regex:
Doubtless there are better ways of doing this whole operation, which I may return to in future posts but for now I simply wanted to know which of these two string comparisons produced the fastest output. Because this is Python, there is a great little module for this called timeit.
At its simplest level, the timeit module can run a single command, in this case a list comprehension converting the string hello world to upper case and output the execution time:
This time, given in seconds, is the time taken for 1000000 iterations of the command to be run and this number of iterations can be set as follows:
This is great if you need to compare two single line statements, or are willing to spend the time condensing a block of code into a single line statement, but more often you will want to time the execution of a block of code. This is performed by creating a module to test and calling it from timeit:
This has the same result as the single line statements but now imports the TestModule and executes it a defined number of times.
To try and control for timing errors and other operating system processes getting in the way of an accurate result, it is a good idea to run the timeit() command several times to get a range of results. This could be done with a loop, but there is a convenience funtcion called repeat() which does this for you, this works in the same way as the timeit() command, but takes two arguments; repeat, the number of times to loop through the test and number, as discussed above:
This returns a list of results instead of a single float, upon which all manner of tests could be performed. But before I broke out scipy and matplotlib I read this in the timeit documentation:
Back to the question posed earlier: what is faster, Python's in function or the equivalent regex? Using what I just learned about the timeit module I copied the two chunks of code into a new file and wrapped them up as two modules, called them both through timeit and repeated the tests 20 times to get two minimum execution times:
Upon running the code the results were a bit disappointing, there is very little difference between the two methods, suggesting that I should stick with in and remove the import of the regex library to keep things simple.
The first task I came across was filtering out the files that I wanted to merge and storing them in a list. I automatically wrote this statement inside a for loop:
Which looped through all the folders and anything with 'AdministrativeBoundary' in the name is concatenated to the base path and added to the list. However, I wondered if this was the most efficient way of comparing strings, when Python's re module is available. I swapped the line with the equivalent regex:
Doubtless there are better ways of doing this whole operation, which I may return to in future posts but for now I simply wanted to know which of these two string comparisons produced the fastest output. Because this is Python, there is a great little module for this called timeit.
At its simplest level, the timeit module can run a single command, in this case a list comprehension converting the string hello world to upper case and output the execution time:
This time, given in seconds, is the time taken for 1000000 iterations of the command to be run and this number of iterations can be set as follows:
This is great if you need to compare two single line statements, or are willing to spend the time condensing a block of code into a single line statement, but more often you will want to time the execution of a block of code. This is performed by creating a module to test and calling it from timeit:
This has the same result as the single line statements but now imports the TestModule and executes it a defined number of times.
To try and control for timing errors and other operating system processes getting in the way of an accurate result, it is a good idea to run the timeit() command several times to get a range of results. This could be done with a loop, but there is a convenience funtcion called repeat() which does this for you, this works in the same way as the timeit() command, but takes two arguments; repeat, the number of times to loop through the test and number, as discussed above:
This returns a list of results instead of a single float, upon which all manner of tests could be performed. But before I broke out scipy and matplotlib I read this in the timeit documentation:
So the sensible thing to do is to take the minimum value found in the list and consider that the optimal benchmark execution time for a code block.Note: It’s tempting to calculate mean and standard deviation from the result vector and report these. However, this is not very useful. In a typical case, the lowest value gives a lower bound for how fast your machine can run the given code snippet; higher values in the result vector are typically not caused by variability in Python’s speed, but by other processes interfering with your timing accuracy. So the min() of the result is probably the only number you should be interested in. After that, you should look at the entire vector and apply common sense rather than statistics.
Back to the question posed earlier: what is faster, Python's in function or the equivalent regex? Using what I just learned about the timeit module I copied the two chunks of code into a new file and wrapped them up as two modules, called them both through timeit and repeated the tests 20 times to get two minimum execution times:
Upon running the code the results were a bit disappointing, there is very little difference between the two methods, suggesting that I should stick with in and remove the import of the regex library to keep things simple.
Labels:
performance,
python,
regex,
timeit
Wednesday, 10 October 2012
Coordinate Systems and Web Mapping
I have, as have many other GIS people, long struggled with the range of coordinate systems in use in different areas and locations. I am currently trying to cope with on the fly translations between British National Grid and WGS84 Web Mercator, the system of choice for Google Maps et al, for a Web GIS I am developing at work.
I came across this blog post explaining the history of the Web Mercator projection, which served to clear up a lot of my questions and confusion.
For my future reference use EPSG code 3857 and follow the these WKT parameters to correct for Google's assumption that the Earth is spherical:
I came across this blog post explaining the history of the Web Mercator projection, which served to clear up a lot of my questions and confusion.
For my future reference use EPSG code 3857 and follow the these WKT parameters to correct for Google's assumption that the Earth is spherical:
PROJCS["Popular Visualisation CRS / Mercator", GEOGCS["Popular Visualisation CRS", DATUM["WGS84", SPHEROID["WGS84", 6378137.0, 298.257223563, AUTHORITY["EPSG","7059"]], AUTHORITY["EPSG","6055"]], PRIMEM["Greenwich", 0, AUTHORITY["EPSG", "8901"]], UNIT["degree", 0.0174532925199433, AUTHORITY["EPSG", "9102"]], AXIS["E", EAST], AXIS["N", NORTH], AUTHORITY["EPSG","4055"]], PROJECTION["Mercator"], PARAMETER["semi_minor",6378137], PARAMETER["False_Easting", 0], PARAMETER["False_Northing", 0], PARAMETER["Central_Meridian", 0], PARAMETER["Latitude_of_origin", 0], UNIT["metre", 1, AUTHORITY["EPSG", "9001"]], AXIS["East", EAST], AXIS["North", NORTH], AUTHORITY["EPSG","3785"]]
Labels:
bing,
coordinate systems,
google,
WGS84
Monday, 8 October 2012
ArcMap Label Engine Scripting
A question came up on the GIS StackExchange this morning asking about how to perform string indexing in VBScript for ArcMap 10's label engine. It made me realize how verbose VBScript is when compared to Python, and to an extent, JavaScript.
The question required the first 2 characters of a field to be placed on the first line of the label, then the remainder of the field to be on the second line and the third line to contain a another field.
To do this in VBScript requires a left function to get the first two characters and a mid function combined with len to get the remainder of the string:
The JavaScript code is a bit cleaner, using \n instead of the horrendous VBNewline, but still uses the substring function combined with a length function to chop up the string:
The Python code uses the \n newline statement and simple string indexing to chop up the string, string[:2] returns the first two characters of a string and string[2:] returns everything after the first to characters until the end of the string, with no need to use len(). This is a much more elegant solution, which I also think is a lot easier to read, regardless of language preference:
Labels:
arcmap,
javascript,
python,
vbscript
Friday, 28 September 2012
Correcting GPS Coordinates
Today I was given an excel sheet containing an X and Y column of GPS coordinates that someone in another department had attempted to display in ArcMap. They were successful in that they got some points to display, but these points were located somewhere in the Atlantic, I was asked to 'fix' it.
In these situations more information is always useful, but none was forthcoming, so I made an educated guess. Often when GPS units capture data in British National Grid, they miss out the first number of each X and Y value as these correspond to the map sheet identifying letter and rarely change for the average user. These letters have numerical equivalents, so I tacked on the correct numbers to each coordinate pair and displayed them in Arc. Lo and behold, they were now in the right place.
I decided to knock together a quick Python script to automate this process, which gave me the opportunity to learn how to interface with excel sheets directly (I would usually export to *.csv or *.dbf) using the xlrd module and refresh my memory about the idiosyncrasies of the wonderful pyshp module. Below is the code I wrote in an hour or so. Originally I was going to write a small interface for it and pass it on to the department, but that is a task for another day. It's a bit ugly, but it does what is needed.
In these situations more information is always useful, but none was forthcoming, so I made an educated guess. Often when GPS units capture data in British National Grid, they miss out the first number of each X and Y value as these correspond to the map sheet identifying letter and rarely change for the average user. These letters have numerical equivalents, so I tacked on the correct numbers to each coordinate pair and displayed them in Arc. Lo and behold, they were now in the right place.
I decided to knock together a quick Python script to automate this process, which gave me the opportunity to learn how to interface with excel sheets directly (I would usually export to *.csv or *.dbf) using the xlrd module and refresh my memory about the idiosyncrasies of the wonderful pyshp module. Below is the code I wrote in an hour or so. Originally I was going to write a small interface for it and pass it on to the department, but that is a task for another day. It's a bit ugly, but it does what is needed.
Thursday, 27 September 2012
Square Buffers With FME
Using FME has a lot of advantages, however every now and then a function which comes as standard in Quantum or Arc is not available. Today I was trying to clip a raster image of an OS map tile to a buffer of the minimum bounding rectangle of my data, so that I have some context map surrounding my areas of interest.
I knocked together what I assumed would do the trick at the end of my workflow, using the Bounding Box Accumulator to get a bounding box, the Bufferer to this polygon buffer by 500 metres and a clipper to clip the map tile to the buffered polygon.
This resulted in a rounded map output, which although not incorrect, looked a bit odd to me. Tinkering with the options of the Bufferer did not yield any different buffer shapes, so I began to think creatively. Using square end caps, it is possible to create rectangular buffers around lines, so the objective became to explode the bounding box out into its individual lines.
I achieved this through the TopologyBuilder transformer which allows an area to be input and a line output, and by setting the Maximum Coords Per Line parameter to 2, four individual lines are created. These lines could then be buffered to 500 metres each using square end caps to create 4 overlapping rectangles which covered my desired buffer extent.
The final stage was to aggregate these individual polygons into one larger shape, which can then be used as the clipper to cut out the section of OS map I required. The final workflow is more convoluted than the original plan, but it works without any problems to give me the square buffer I required.
Update: I noticed an error with this workbench where on larger areas the centre of the clipping polygon is not filled in, creating a doughnut. This is easily fixed by connecting the TopologyBuilder's area output into the aggregator, as the orginal bounding box will always cover the centre of the tile.
I knocked together what I assumed would do the trick at the end of my workflow, using the Bounding Box Accumulator to get a bounding box, the Bufferer to this polygon buffer by 500 metres and a clipper to clip the map tile to the buffered polygon.
This resulted in a rounded map output, which although not incorrect, looked a bit odd to me. Tinkering with the options of the Bufferer did not yield any different buffer shapes, so I began to think creatively. Using square end caps, it is possible to create rectangular buffers around lines, so the objective became to explode the bounding box out into its individual lines.
I achieved this through the TopologyBuilder transformer which allows an area to be input and a line output, and by setting the Maximum Coords Per Line parameter to 2, four individual lines are created. These lines could then be buffered to 500 metres each using square end caps to create 4 overlapping rectangles which covered my desired buffer extent.
Blue lines are the individual buffers and the red polygon is the original bounding box. |
The final stage was to aggregate these individual polygons into one larger shape, which can then be used as the clipper to cut out the section of OS map I required. The final workflow is more convoluted than the original plan, but it works without any problems to give me the square buffer I required.
Update: I noticed an error with this workbench where on larger areas the centre of the clipping polygon is not filled in, creating a doughnut. This is easily fixed by connecting the TopologyBuilder's area output into the aggregator, as the orginal bounding box will always cover the centre of the tile.
Tuesday, 25 September 2012
Python's Null Values
A problem which I and many others have encountered when using python for GIS tasks is the identification of python's different ways of representing an empty value, generally described in other languages as NULL.
For example, in SQL, to select a series of rows from a table where an attribute is NULL is performed using the command:
SELECT * FROM [TABLE_NAME] WHERE [FIELD] IS NULL
Clearly an unassigned variable is useless as it generates an error, but the x and y values can both be used, but are they equal?
For example, in SQL, to select a series of rows from a table where an attribute is NULL is performed using the command:
SELECT * FROM [TABLE_NAME] WHERE [FIELD] IS NULL
This query will return any row in which FIELD is empty, regardless of why it is empty.
In python things are a bit more complex, there are three possible types of null which python can deal with and knowing which null type you are using is important in avoiding seemingly inexplicable errors from cropping up.
The three types are:
- None - the python equivalent of SQL's NULL
- "" - an empty string
- An unassigned variable
These three types can be demonstrated at the python shell:
>>> x == y False >>>
They may conceptually be the same thing, but the interpreter considers them to be distinct, which can present problems when trying to set loop conditions, or run models until a null value is filled or encountered. The key is to be clear at the outset which kind of null values you will be encountering.
Labels:
python
Welcome
I have set this blog up as a means for me to keep track of code I write, write small how to guides for my own reference, list useful tools I come across and generally archive interesting things.
If its of any use to other people then that's a bonus.
I expect to cover topics including:
I expect to cover topics including:
- Python
- General GIS
- FME
- arcpy
- LiDAR
- Geomorphology
- ArcGIS and ArcPad
- MapGuide
Labels:
misc
Subscribe to:
Posts (Atom)