Thursday, April 7, 2016

Data Normalization, Geocoding and Error Assessment: Sand Mining Suitability Project

Goals and Objectives

The goal of this lab was to become familiar with the process of normalizing a table given to us from a professional source, in this case the DNR. Unfortunately most companies and organizations do not know the proper format that a table needs to be in in order to work with it in ArcMap or other programs. Our job was to learn and execute the proper format of the table in order to later on geocode the addresses of sand mines. The purpose of normalizing the data was to then learn the geocoding process, and fix any addresses that were unmatched or tied. Later on we would look at the types of errors occurring throughout the entire process and how those errors can be eliminated and/or checked.

Methods

Geocoding is the process of taking existing addresses and turning them into features on a map. There were two methods we went about this. One method for addresses given in the standard way (street address, city, state, zip code) and the other method for addresses given in PLSS (Public Land Survey System) which divides the state into a grid and gives the township, range and section.

Addresses given in the standard way were slightly easier to work with but still required some maintenance. The addresses were given in one column, so we had to separate the sand mine addresses by street address in one column, city in another column, state in another and zip code in another column. This is what is required in order for geocoding to work and make a match. If a match was made it was good practice to make sure the address was correct and fix it, this was done by checking other sources such as google maps. If the address was unmatched you would physically go into the map and pick the correct address by using other sources. And if the match was tied you would still go in and pick the correct option or create the correct address by picking it yourself.

If the Address was given in the PLSS format a little more work and investigating had to be done because geocoding does not work with PLSS addresses. The DNR geodatabase was used to add the feature classes of townships, ranges and sections. The county feature class was also added to double check I was in the correct vicinity of the sand mine. Using these feature classes the area was narrowed down to where the sand mine should be, if it wasn't there (because the basemap imagery is old and sand mines are developing very quickly across the state of Wisconsin) then google maps was used to locate the sand mine. Once the sand mine was located in the area the address was placed.

After all addresses were finalized this point feature class was exported as a shapefile to the share folder where other students geocoding the same sand mines could compare and measure the distance between their own geocoded mine, other student's geocoded mine and the actual address of the mine. The way I chose to measure the distance was to add one student's geocoded points at a time, label them by the field "unique mine ID" and then take the measure tool and measure the distance between the two mines that had the same unique mine ID label. The results of these distances are in Table 3.

Results

Table 1: Sand mine addresses given to us from the DNR before normalization took place.

Notice in the above table (table 1) that the addresses of the sand mines were simply put into one column with the street address and/or the PLSS address. These had to be separated out as shown below in Table 2.
Table 2: Sand mine addresses after normalization

Now columns have been added to this table for street address, city, state and zip code. Making this all separate allows there to be a match when geocoding.
Table 3: Error table showing the distance from my geocoded mines to the same mines geocoded by other students and showing the distance from my geocoded mines to the actual location of those mines

Table 3 shows just how difficult geocoding can be. After geocoding each mine by looking into other sources and double checking what I thought would be the correct address I would still be off by a certain number of meters. The closest I came to geocoding to the correct location was 44 meters. This sounds like a large distance but in fact we were told to geocode the address to the entrance of the sand mine and the actual addresses were located in the center of the mine. So all addresses will be off by at least 40 meters. The largest distance between my geocoded mine and the actual location was 25,794 meters. This is a tremendous distance and unacceptable if the addresses I geocoded needed to be used for a professional source. I found that if the distance was at 1,000 meters or less my geocoded mine was actually fairly close. Anything over 1,000 meters could be of no use to anyone. 
Figure 1: The 16 mines I was assigned to geocode

In figure 1 are the 16 mine addresses I geocoded. Each mine address was either matched, unmatched or tied. For those mines unmatched and tied the address was fixed by picking the address and placing it at the entrance of the mine itself. For the mines that were matched it was still good practice to go in and make sure that they were matched to the actual location of the mine, if not the same procedure was done to pick the address at the entrance of the sand mine. 
Figure 2: Proportional symbol map showing the error distance between the mines I geocoded and the actual location of those mines

Figure 2 maps the error distance from my geocoded mine to the actual location of that mine. The smaller circles indicating the smaller distance and therefore less error and the larger circles indicating a larger distance, therefore indicating a larger error. This map just illustrates how difficult it is to geocode to the correct location. 

Discussion

Error is defined as "the deviation between the measured value and the real world feature" (Lo, Data Quality and Data Standards). And there are many different types of errors such as gross errors, systematic errors and random errors. Some or all of these errors may have occurred through this entire process. Gross errors are simply mistakes. These errors could have happened if I typed the address in wrong as I was normalizing the table. Systematic errors are caused from many things, in this case it could have been how I chose to measure the distance between two points. The way I measured is probably different than how many other people went about this process. And the accuracy may not be as precise as other methods. How I then rounded the number to a whole number may also have caused minor errors to the data. These same errors of measurement and rounding numbers could also be considered random errors which are, "those discrepancies in the measurements that remain after gross and systematic errors have been eliminated" (Lo, Data Quality and Data Standards).

Errors in geographic data can then be split into two categories, inherent errors and operational errors. Inherent errors are errors that inevitably happen when trying to represent real world objects on a map, something is always distorted, scales change and nothing can remain perfect through all of this. Operational errors are errors that occur during the collecting, managing and using geographic data (Lo, Data Quality and Data Standards). 

The original data had inherent and operational errors. After it was given to me to normalize the data it retained even more inherent and operational errors. So the question is how are these points that have been geocoded correct at all? Accuracy measurements can be taken to see which points are actually reliable. These may include equations such as Root Mean Square Error and a Matrix table. These values will allow the user to know if the data given to them is reliable and true. 

Conclusion

Geocoding can be very difficult when given a table that has not been normalized in the proper way. Many mistakes and errors can be made along the way that may be your own fault (operational) or because maps can never be the true representation of the real world (inherent). Luckily there are ways to measure the accuracy of the data and see just how much error has come with the data by running equations such as Root Mean Square Error and a Matrix table. 

Sources

  • Lo, Data Quality and Data Standards (reading on D2L)
  • Wisconsin DNR geodatabase (WiDNR2014.gdb)
  • Esri basemaps