GIS FUNDAMENTALS

Instructor
Course Description

This course introduces the principal concepts and techniques of geographic information systems (GIS). The course consists of two interrelated parts: a theoretical one that focuses on the concepts and a practical one that aims at developing hands-on skills in using (mostly software) tools. The concepts and techniques introduced in this course will be further enhanced during subsequent courses of the study.

Course Objective

The main objective of the course is to learn how to generate information about the Earth from data stored in geographic Information Systems. At the end of this core module participants must be able to:

Explain the concept, definition, nature and scope of GIS;
Describe the nature of geographic phenomena;
Outline the principal data models for spatial and non-spatial data used in GIS databases;
Outline the main components of a GIS and their functions;
Explain the relationship between spatial data and coordinate systems;
Outline the main spatial data analysis functions;
Describe aspects of data quality and how various stages of spatial data handling;
Carry out basic GIS operations:
Carry out basic data preparation, geo-referencing and data entry into a GIS;
Perform basic manipulation, analysis and visualization operations using a GIS;
Apply basic data quality assessment procedures;
Apply appropriate GIS methods for problem solving:
Understand the capabilities, uses and limitations of GIS in their field of application;
Evaluate the results of data processing;
Be aware of organizational issues of GIS development and implementation.

Course Content

CHAPTER 1 – Geographic phenomena

1.1 Introduction

A GIS operates under the assumption that the relevant spatial phenomena occur in a two- or three-dimensional Euclidean space, unless otherwise specified. Euclidean space can be informally defined as a model of space in which locations are represented by coordinates—(x; y) in 2D; (x; y; z) in 3D—and distance and direction can defined with geometric formulas. In the 2D case, this is known as the Euclidean plane, which is the most common Euclidean space in GIS use.

In order to be able to represent relevant aspects of real world phenomena inside a GIS, we first need to define what it is we are referring to. We might define a geographic phenomenon as a manifestation of an entity or process of interest that:

_ Can be named or described,

_ Can be georeferenced, and

_ Can be assigned a time (interval) at which it is/was present.

The relevant phenomena for a given application depends entirely on one’s objectives.

For instance, in water management, the objects of study might be river basins, agro-ecologic units, measurements of actual evapotranspiration, meteorological data, ground water levels, irrigation levels, water budgets and measurements of total water use. Note that all of these can be named or described, georeferenced and provided with a time interval at which each exists. In multipurpose cadastral administration, the objects of study are different: houses, land parcels, streets of various types, land use forms, sewage canals and other forms of urban infrastructure may all play a role. Again, these can be named or described, georeferenced and assigned a time interval of existence.

Not all relevant phenomena come as triplets (description; georeference; time- interval), though many do. If the georeference is missing, we seem to have something of interest that is not positioned in space: an example is a legal document in a cadastral system. It is obviously somewhere, but its position in space is not considered relevant. If the time interval is missing, we might have a phenomenon of interest that is considered to be always there, i.e. the time interval is (likely to be considered) infinite. If the description is missing, then we have something that exists in space and time, yet cannot be described.

1.2 Types of geographic phenomena

The attempted definition of geographic phenomena above is necessarily abstract, and therefore perhaps somewhat difficult to grasp. The main reason for this is that geographic phenomena come in so many different ‘flavours’, which we will try to categorize below. Before doing so, we must make two further observations.

Firstly, In order to be able to represent a phenomenon in a GIS requires us to state what it is, and where it is. We must provide a description—or at least a name—on the one hand, and a georeference on the other hand. We will skip over the temporal issues for now, and come back to these in Section 2.5. The reason for this is that current GISs do not provide much automatic support for time-dependent data, and that this topic must be therefore be considered an issue of advanced GIS use.

Geographic Field
Secondly, some phenomena manifest themselves essentially everywhere in the study area, while others only do so in certain localities. If we define our study area as the equatorial Pacific Ocean, we can say that Sea Surface Temperature can be measured anywhere in the study area. Therefore, it is a typical example of a (geographic) field.

A (geographic) field is a geographic phenomenon for which, for every point in the study area, a value can be determined.

Some common examples of geographic fields are air temperature, barometric pressure and elevation. These fields are in fact continuous in nature. Examples of discrete fields are land use and soil classifications. For these too, any location in the study area is attributed a single land use class or soil class.

Geographic objects
Many other phenomena do not manifest themselves everywhere in the study area, but only in certain localities. The array of buoys of the previous chapter is a good example: there is a fixed number of buoys, and for each, we know exactly Objects where it is located. The buoys are typical examples of (geographic) objects.
(Geographic) objects populate the study area, and are usually well-distinguished, discrete, and bounded entities. The space between them is potentially ‘empty’ or undetermined.
A simple rule-of-thumb is that natural geographic phenomena are usually fields, and man-made phenomena are usually objects. Many exceptions to this rule actually exist, so one must be careful in applying it.

1.3 Geographic fields

A field is a geographic phenomenon that has a value ‘everywhere’ in the study area. We can therefore think of a field as a mathematical function f that associates a specific value with any position in the study area. Hence if (x; y) is a position in the study area, then f(x; y) stands for the value of the field f at locality (x; y).

Fields can be discrete or continuous.

Continuous field

Continuous data, or a continuous surface, represents phenomena where each location on the surface is a measure of the concentration level or its relationship from a fixed point in space or from an emitting source. Continuous data is also referred to as field, nondiscrete, or surface data.

One type of continuous surface data is derived from those characteristics that define a surface where each location is measured from a fixed registration point. These include elevation (the fixed point being sea level) and aspect (the fixed point being direction: north, east, south, and west).

Discrete fields

Discrete data, also known as categorical or discontinuous data, mainly represents objects in both the feature and raster data storage systems. A discrete object has known and definable boundaries. It is easy to define precisely where the object begins and ends. A lake is a discrete object within the surrounding landscape. Where the water’s edge meets the land can be definitively established. Other examples of discrete objects include buildings, roads, and land parcels. Discrete objects are usually nouns.

Field-based model
Essentially, these two types of fields differ in the type of cell values. A discrete field like landuse type will store cell values of the type ‘integer’. Therefore it is also called an integer raster. Discrete fields can be easily converted to polygons, since it is relatively easy to draw a boundary line around a group of cells with the same value. A continuous raster is also called a ‘floating point’ raster. A Field-based model consists of a finite collection of geographic fields: we may be interested in elevation, barometric pressure, mean annual rainfall, and maximum daily evapotranspiration, and thus use four different fields to model the relevant phenomena within our study area.

1.4 Data types and values

Since we have now differentiated between continuous and discrete fields, we may also look at different kinds of data values which we can use to represent our ‘phenomena’. It is important to note that some of these data types limit the types of analyses that we can do on the data itself:
1. Nominal data values are values that provide a name or identifier so that we can discriminate between different values, but that is about all we can do. Specifically, we cannot do true computations with these values. An example are the names of geological units. This kind of data value is called categorical data when the values assigned are sorted according to some set of non-overlapping categories. For example, we might identify the soil type of a given area to belong to a certain (pre-defined) category.
2. Ordinal data values are data values that can be put in some natural sequence but that do not allow any other type of computation. Household income, for instance, could be classified as being either ‘low’, ‘average’ or ‘high’. Clearly this is their natural sequence, but this is all we can say—we can not say that a high income is twice as high as an average income.
3. Interval data values are quantitative, in that they allow simple forms of computation like addition and subtraction. However, interval data has no arithmetic zero value, and does not support multiplication or division. For instance, a temperature of 20 ⁰C is not twice as warm as 10⁰C , and thus centigrade temperatures are interval data values, not ratio data values.
4. Ratio data values allow most, if not all, forms of arithmetic computation. Rational data have a natural zero value, and multiplication and division of values are possible operators (distances measured in metres are an example). Continuous fields can be expected to have ratio data values, and hence we can interpolate them.

Qualitative and quantitative data

We usually refer to nominal and categorical data values as ‘qualitative’ data, because we are limited in terms of the computations we can do on this type of data. Interval and ratio data is known as ‘quantitative’ data, as it refers to quantities. However, ordinal data does not seem to fit either of these data types. Often, Qualitative and quantitative ordinal data refers to a ranking scheme or some kind of hierarchical phenomena- data. Road networks, for example, are made up of motorways, main roads, and residential streets. We might expect roads classified as motorways to have more lanes and carry more traffic and than a residential street.

1.5 Geographic objects

When a geographic phenomenon is not present everywhere in the study area, but somehow ‘sparsely’ populates it, we look at it as a collection of geographic objects. Such objects are usually easily distinguished and named, and their position in space is determined by a combination of one or more of the following parameters:
 Location (where is it?),
 Shape (what form is it?),
 Size (how big is it?), and
 Orientation (in which direction is it facing?).
How we want to use the information about a geographic object determines which of the four above parameters is required to represent it. For instance, in an in-car navigation system, all that matters about geographic objects like petrol stations is where they are. Thus, location alone is enough to describe them in this particular context, and shape, size and orientation are not necessarily relevant.
In the same system, however, roads are important objects, and for these some notion of location (where does it begin and end), shape (how many lanes does it have), size (how far can one travel on it) and orientation (in which direction can one travel on it) seem to be relevant information components.
The shape is usually important because one of its factors is dimension. This relates to whether an object is perceived as a point feature, or a linear, area or volume feature.
The petrol stations mentioned above apparently are zero-dimensional, i.e. they are perceived as points in space; roads are one-dimensional, as they are con- Dimensionality of features
sidered to be lines in space. In another use of road information—for instance, in multi-purpose cadastre systems where precise location of sewers and manhole covers matters—roads might well be considered to be two-dimensional entities, i.e. areas within which a manhole cover may fall.
We usually do not study geographic objects in isolation, but more often we look at collections of objects viewed as a unit. These object collections may also have specific geographic characteristics. Most of the more interesting collections of geographic objects obey certain natural laws. The most common (and obvious) of these is that different objects do not occupy the same location. This, for instance, holds for the collection of petrol stations in an in-car navigation system, the collection of roads in that system, the collection of land parcels in a cadastral
system, and in many more cases.
Collections of geographic objects can be interesting phenomena at a higher aggregation level: forest plots form forests, groups of parcels form suburbs, streams, brooks and rivers form a river drainage system, roads form a road network, and SST buoys form an SST sensor network. It is sometimes useful to view geo- Geographic scale graphic phenomena at this more aggregated level and look at characteristics like coverage, connectedness, and capacity. For example:
Which part of the road network is within 5 km of a petrol station? (A coverage question)
What is the shortest route between two cities via the road network? (A connectedness question)
How many cars can optimally travel from one city to another in an hour?(A capacity question)
Other spatial relationships between the members of a geographic object collection may exist and can be relevant in GIS usage. Many of them fall in the category of topological relationships

CHAPTER 2 – Geographic information and spatial data types

2.1. GIS

A Geographic Information System (GIS) is a system designed to capture, store, manipulate, analyze, manage, and present spatial or geographic data. GIS applications are tools that allow users to create interactive queries (user-created searches), analyze spatial information, edit data in maps, and present the results of all these operations.

Today, GIS is a multi-billion-dollar industry employing hundreds of thousands of people worldwide. GIS is taught in schools, colleges, and universities throughout the world. Professionals and domain specialists in every discipline are become increasingly aware of the advantages of using GIS technology for addressing their unique spatial problems.

We commonly think of a GIS as a single, well-defined, integrated computer system. However, this is not always the case. A GIS can be made up of a variety of software and hardware tools. The important factor is the level of integration of these tools to provide a smoothly operating, fully functional geographic data processing environment.

Overall, GIS should be viewed as a technology, not simply as a computer system.

In general, a GIS provides facilities for data capture, data management, data manipulation and analysis, and the presentation of results in both graphic and report form, with a particular emphasis upon preserving and utilizing inherent characteristics of spatial data.

The ability to incorporate spatial data, manage it, analyze it, and answer spatial questions is the distinctive characteristic of geographic information systems.

A geographic information system, commonly referred to as a GIS, is an integrated set of hardware and software tools used for the manipulation and management of digital spatial (geographic) and related attribute data.

2.2. COMPONENTS OF A GIS

An operational GIS also has a series of components that combine to make the system work. These components are critical to a successful GIS.

A working GIS integrates five key components:

HARDWARE,

SOFTWARE,

DATA,

PEOPLE,

METHODS

Hardware

Hardware is the computer system on which a GIS operates. Today, GIS software runs on a wide range of hardware types, from centralized computer servers to desktop computers used in stand-alone or networked configurations.

Software

GIS software provides the functions and tools needed to store, analyze, and display geographic information. A review of the key GIS software subsystems is provided above.

Data

Perhaps the most important component of a GIS is the data. Geographic data and related tabular data can be collected in-house, compiled to custom specifications and requirements, or occasionally purchased from a commercial data provider. A GIS can integrate spatial data with other existing data resources, often stored in a corporate DBMS. The integration of spatial data (often proprietary to the GIS software), and tabular data stored in a DBMS is a key functionality afforded by GIS.

People

GIS technology is of limited value without the people who manage the system and develop plans for applying it to real world problems. GIS users range from technical specialists who design and maintain the system to those who use it to help them perform their everyday work. The identification of GIS specialists versus end users is often critical to the proper implementation of GIS technology.

Methods

A successful GIS operates according to a well-designed implementation plan and business rules, which are the models and operating practices unique to each organization.

As in all organizations dealing with sophisticated technology, new tools can only be used effectively if they are properly integrated into the entire business strategy and operation. To do this properly requires not only the necessary investments in hardware and software, but also in the retraining and/or hiring of personnel to utilize the new technology in the proper organizational context. Failure to implement your GIS without regard for a proper organizational commitment will result in an unsuccessful system ! Many of the issues concerned with organizational commitment are described in Implementation Issues and Strategies.

2.3 GIS DATA MODELS

A GIS stores information about the world as a collection of thematic layers that can be linked together by geography. This simple but extremely powerful and versatile concept has proven invaluable for solving many real-world problems from tracking delivery vehicles, to recording details of planning applications, to modeling global atmospheric circulation. The thematic layer approach allows us to organize the complexity of the real world into a simple representation to help facilitate our understanding of natural relationships.

The thematic layer approach allows us to organize the complexity of the real world

2.4 GIS DATA TYPES

The basic data type in a GIS reflects traditional data found on a map. Accordingly, GIS technology utilizes two basic types of data. These are:

	Spatial data	describes the absolute and relative location of geographic features.
	Attribute data	describes characteristics of the spatial features. These characteristics can be quantitative and/or qualitative in nature. Attribute data is often referred to as tabular data.

The coordinate location of a forestry stand would be spatial data, while the characteristics of that forestry stand, e.g. cover group, dominant species, crown closure, height, etc., would be attribute data. Other data types, in particular image and multimedia data, are becoming more prevalent with changing technology. Depending on the specific content of the data, image data may be considered either spatial, e.g. photographs, animation, movies, etc., or attribute, e.g. sound, descriptions, narration’s, etc.

2.5 SPATIAL DATA MODELS

raditionally spatial data has been stored and presented in the form of a map. Three basic types of spatial data models have evolved for storing geographic data digitally. These are referred to as:

	Vector;
	Raster;
	Image.

The following diagram reflects the two primary spatial data encoding techniques. These are vector and raster. Image data utilizes techniques very similar to raster data, however typically lacks the internal formats required for analysis and modeling of the data. Images reflect pictures or photographs of the landscape.

Representation of the real world and showing differences in how a vector and a raster GIS will represent this real world.

2.6 VECTOR DATA FORMATS

All spatial data models are approaches for storing the spatial location of geographic features in a database. Vector storage implies the use of vectors (directional lines) to represent a geographic feature. Vector data is characterized by the use of sequential points or vertices to define a linear segment. Each vertex consists of an X coordinate and a Y coordinate.

Vector lines are often referred to as arcs and consist of a string of vertices terminated by a node. A node is defined as a vertex that starts or ends an arc segment. Point features are defined by one coordinate pair, a vertex. Polygonal features are defined by a set of closed coordinate pairs. In vector representation, the storage of the vertices for each feature is important, as well as the connectivity between features, e.g. the sharing of common vertices where features connect.

Several different vector data models exist, however only two are commonly used in GIS data storage.

The most popular method of retaining spatial relationships among features is to explicitly record adjacency information in what is known as the topologic data model. Topology is a mathematical concept that has its basis in the principles of feature adjacency and connectivity.

The topologic data structure is often referred to as an intelligent data structure because spatial relationships between geographic features are easily derived when using them. Primarily for this reason the topologic model is the dominant vector data structure currently used in GIS technology. Many of the complex data analysis functions cannot effectively be undertaken without a topologic vector data structure. Topology is reviewed in greater detail later on in the book.

The secondary vector data structure that is common among GIS software is the computer-aided drafting (CAD) data structure. This structure consists of listing elements, not features, defined by strings of vertices, to define geographic features, e.g. points, lines, or areas. There is considerable redundancy with this data model since the boundary segment between two polygons can be stored twice, once for each feature. The CAD structure emerged from the development of computer graphics systems without specific considerations of processing geographic features. Accordingly, since features, e.g. polygons, are self-contained and independent, questions about the adjacency of features can be difficult to answer. The CAD vector model lacks the definition of spatial relationships between features that is defined by the topologic data model.

GIS MAP Structure – VECTOR systems (Adapted from Berry)

2.7 RASTER DATA FORMATS

Raster data models incorporate the use of a grid-cell data structure where the geographic area is divided into cells identified by row and column. This data structure is commonly called raster. While the term raster implies a regularly spaced grid other tessellated data structures do exist in grid based GIS systems. In particular, the quadtree data structure has found some acceptance as an alternative raster data model.

The size of cells in a tessellated data structure is selected on the basis of the data accuracy and the resolution needed by the user. There is no explicit coding of geographic coordinates required since that is implicit in the layout of the cells. A raster data structure is in fact a matrix where any coordinate can be quickly calculated if the origin point is known, and the size of the grid cells is known. Since grid-cells can be handled as two-dimensional arrays in computer encoding many analytical operations are easy to program. This makes tessellated data structures a popular choice for many GIS software. Topology is not a relevant concept with tessellated structures since adjacency and connectivity are implicit in the location of a particular cell in the data matrix.

Several tessellated data structures exist, however only two are commonly used in GIS’s. The most popular cell structure is the regularly spaced matrix or raster structure. This data structure involves a division of spatial data into regularly spaced cells. Each cell is of the same shape and size. Squares are most commonly utilized.

Since geographic data is rarely distinguished by regularly spaced shapes, cells must be classified as to the most common attribute for the cell. The problem of determining the proper resolution for a particular data layer can be a concern. If one selects too coarse a cell size then data may be overly generalized. If one selects too fine a cell size then too many cells may be created resulting in a large data volume, slower processing times, and a more cumbersome data set. As well, one can imply accuracy greater than that of the original data capture process and this may result in some erroneous results during analysis.

As well, since most data is captured in a vector format, e.g. digitizing, data must be converted to the raster data structure. This is called vector-raster conversion. Most GIS software allows the user to define the raster grid (cell) size for vector-raster conversion. It is imperative that the original scale, e.g. accuracy, of the data be known prior to conversion. The accuracy of the data, often referred to as the resolution, should determine the cell size of the output raster map during conversion.

Most raster based GIS software requires that the raster cell contain only a single discrete value. Accordingly, a data layer, e.g. forest inventory stands, may be broken down into a series of raster maps, each representing an attribute type, e.g. a species map, a height map, a density map, etc. These are often referred to as one attribute maps. This is in contrast to most conventional vector data models that maintain data as multiple attribute maps, e.g. forest inventory polygons linked to a database table containing all attributes as columns. This basic distinction of raster data storage provides the foundation for quantitative analysis techniques. This is often referred to as raster or map algebra. The use of raster data structures allow for sophisticated mathematical modelling processes while vector based systems are often constrained by the capabilities and language of a relational DBMS.

GIS MAP Structure – RASTER systems (Adapted from Berry)

This difference is the major distinguishing factor between vector and raster based GIS software. It is also important to understand that the selection of a particular data structure can provide advantages during the analysis stage. For example, the vector data model does not handle continuous data, e.g. elevation, very well while the raster data model is more ideally suited for this type of analysis. Accordingly, the raster structure does not handle linear data analysis, e.g. shortest path, very well while vector systems do. It is important for the user to understand that there are certain advantages and disadvantages to each data model.

The selection of a particular data model, vector or raster, is dependent on the source and type of data, as well as the intended use of the data. Certain analytical procedures require raster data while others are better suited to vector data.

2.8 IMAGE DATA

Image data is most often used to represent graphic or pictorial data. The term image inherently reflects a graphic representation, and in the GIS world, differs significantly from raster data. Most often, image data is used to store remotely sensed imagery, e.g. satellite scenes or orthophotos, or ancillary graphics such as photographs, scanned plan documents, etc. Image data is typically used in GIS systems as background display data (if the image has been rectified and georeferenced); or as a graphic attribute. Remote sensing software makes use of image data for image classification and processing. Typically, this data must be converted into a raster format (and perhaps vector) to be used analytically with the GIS.

Image data is typically stored in a variety of de facto industry standard proprietary formats. These often reflect the most popular image processing systems. Other graphic image formats, such as TIFF, GIF, PCX, etc., are used to store ancillary image data. Most GIS software will read such formats and allow you to display this data.

Image data is most often used for remotely sensed imagery such as satellite imagery or digital orthophotos.

2.9 VECTOR AND RASTER – ADVANTAGES AND DISADVANTAGES

There are several advantages and disadvantages for using either the vector or raster data model to store spatial data. These are summarized below.

Vector Data		Advantages :
		Data can be represented at its original resolution and form without generalization.
		Graphic output is usually more aesthetically pleasing (traditional cartographic representation);
		Since most data, e.g. hard copy maps, is in vector form no data conversion is required.
		Accurate geographic location of data is maintained.
		Allows for efficient encoding of topology, and as a result more efficient operations that require topological information, e.g. proximity, network analysis.

		Disadvantages:
		The location of each vertex needs to be stored explicitly.
		For effective analysis, vector data must be converted into a topological structure. This is often processing intensive and usually requires extensive data cleaning. As well, topology is static, and any updating or editing of the vector data requires re-building of the topology.
		Algorithms for manipulative and analysis functions are complex and may be processing intensive. Often, this inherently limits the functionality for large data sets, e.g. a large number of features.
		Continuous data, such as elevation data, is not effectively represented in vector form. Usually substantial data generalization or interpolation is required for these data layers.
		Spatial analysis and filtering within polygons is impossible

Raster Data		Advantages :
		The geographic location of each cell is implied by its position in the cell matrix. Accordingly, other than an origin point, e.g. bottom left corner, no geographic coordinates are stored.
		Due to the nature of the data storage technique data analysis is usually easy to program and quick to perform.
		The inherent nature of raster maps, e.g. one attribute maps, is ideally suited for mathematical modeling and quantitative analysis.
		Discrete data, e.g. forestry stands, is accommodated equally well as continuous data, e.g. elevation data, and facilitates the integrating of the two data types.
		Grid-cell systems are very compatible with raster-based output devices, e.g. electrostatic plotters, graphic terminals.

		Disadvantages:
		The cell size determines the resolution at which the data is represented.;
		It is especially difficult to adequately represent linear features depending on the cell resolution. Accordingly, network linkages are difficult to establish.
		Processing of associated attribute data may be cumbersome if large amounts of data exists. Raster maps inherently reflect only one attribute or characteristic for an area.
		Since most input data is in vector form, data must undergo vector-to-raster conversion. Besides increased processing requirements this may introduce data integrity concerns due to generalization and choice of inappropriate cell size.
		Most output maps from grid-cell systems do not conform to high-quality cartographic needs.

It is often difficult to compare or rate GIS software that use different data models. Some personal computer (PC) packages utilize vector structures for data input, editing, and display but convert to raster structures for any analysis. Other more comprehensive GIS offerings provide both integrated raster and vector analysis techniques. They allow users to select the data structure appropriate for the analysis requirements. Integrated raster and vector processing capabilities are most desirable and provide the greatest flexibility for data manipulation and analysis.

2.10 ATTRIBUTE DATA MODELS

A separate data model is used to store and maintain attribute data for GIS software. These data models may exist internally within the GIS software, or may be reflected in external commercial Database Management Software (DBMS). A variety of different data models exist for the storage and management of attribute data. The most common are:

	Tabular
	Hierarchial
	Network
	Relational
	Object Oriented

The tabular model is the manner in which most early GIS software packages stored their attribute data. The next three models are those most commonly implemented in database management systems (DBMS). The object oriented is newer but rapidly gaining in popularity for some applications. A brief review of each model is provided.

Tabular Model

The simple tabular model stores attribute data as sequential data files with fixed formats (or comma delimited for ASCII data), for the location of attribute values in a predefined record structure. This type of data model is outdated in the GIS arena. It lacks any method of checking data integrity, as well as being inefficient with respect to data storage, e.g. limited indexing capability for attributes or records, etc.

Hierarchical Model

The hierarchical database organizes data in a tree structure. Data is structured downward in a hierarchy of tables. Any level in the hierarchy can have unlimited children, but any child can have only one parent. Hierarchial DBMS have not gained any noticeable acceptance for use within GIS. They are oriented for data sets that are very stable, where primary relationships among the data change infrequently or never at all. Also, the limitation on the number of parents that an element may have is not always conducive to actual geographic phenomenon.

Network Model

The network database organizes data in a network or plex structure. Any column in a plex structure can be linked to any other. Like a tree structure, a plex structure can be described in terms of parents and children. This model allows for children to have more than one parent.

Network DBMS have not found much more acceptance in GIS than the hierarchical DBMS. They have the same flexibility limitations as hierarchical databases; however, the more powerful structure for representing data relationships allows a more realistic modelling of geographic phenomenon. However, network databases tend to become overly complex too easily. In this regard it is easy to lose control and understanding of the relationships between elements.

Relational Model

The relational database organizes data in tables. Each table, is identified by a unique table name, and is organized by rows and columns. Each column within a table also has a unique name. Columns store the values for a specific attribute, e.g. cover group, tree height. Rows represent one record in the table. In a GIS each row is usually linked to a separate spatial feature, e.g. a forestry stand. Accordingly, each row would be comprised of several columns, each column containing a specific value for that geographic feature. The following figure presents a sample table for forest inventory features. This table has 4 rows and 5 columns. The forest stand number would be the label for the spatial feature as well as the primary key for the database table. This serves as the linkage between the spatial definition of the feature and the attribute data for the feature.

UNIQUE STAND NUMBER	DOMINANT COVER GROUP	AVG. TREE HEIGHT	STAND SITE INDEX	STAND AGE
001	DEC	3	G	100
002	DEC-CON	4	M	80
003	DEC-CON	4	M	60
004	CON	4	G	120

Data is often stored in several tables. Tables can be joined or referenced to each other by common columns (relational fields). Usually the common column is an identification number for a selected geographic feature, e.g. a forestry stand polygon number. This identification number acts as the primary key for the table. The ability to join tables through use of a common column is the essence of the relational model. Such relational joins are usually ad hoc in nature and form the basis of for querying in a relational GIS product. Unlike the other previously discussed database types, relationships are implicit in the character of the data as opposed to explicit characteristics of the database set up.

The relational database model is the most widely accepted for managing the attributes of geographic data.

There are many different designs of DBMSs, but in GIS the relational design has been the most useful. In the relational design, data are stored conceptually as a collection of tables. Common fields in different tables are used to link them together. This surprisingly simple design has been so widely used primarily because of its flexibility and very wide deployment in applications both within and without GIS.

In the relational design, data are stored conceptually as a collection of tables. Common fields in different tables are used to link them together.

In fact, most GIS software provides an internal relational data model, as well as support for commercial off-the-shelf (COTS) relational DBMS’. COTS DBMS’ are referred to as external DBMS’. This approach supports both users with small data sets, where an internal data model is sufficient, and customers with larger data sets who utilize a DBMS for other corporate data storage requirements. With an external DBMS the GIS software can simply connect to the database, and the user can make use of the inherent capabilities of the DBMS. External DBMS’ tend to have much more extensive querying and data integrity capabilities than the GIS’ internal relational model. The emergence and use of the external DBMS is a trend that has resulted in the proliferation of GIS technology into more traditional data processing environments.

The relational DBMS is attractive because of its:

	simplicity in organization and data modelling.
	flexibility – data can be manipulated in an ad hoc manner by joining tables.
	efficiency of storage – by the proper design of data tables redundant data can be minimized; and
	the non-procedural nature – queries on a relational database do not need to take into account the internal organization of the data.

The relational DBMS has emerged as the dominant commercial data management tool in GIS implementation and application.

The following diagram illustrates the basic linkage between a vector spatial data (topologic model) and attributes maintained in a relational database file.

Basic linkages between a vector spatial data (topologic model) and attributes maintained in a relational database file (From Berry)

Object-Oriented Model

The object-oriented database model manages data through objects. An object is a collection of data elements and operations that together are considered a single entity. The object-oriented database is a relatively new model. This approach has the attraction that querying is very natural, as features can be bundled together with attributes at the database administrator’s discretion. To date, only a few GIS packages are promoting the use of this attribute data model. However, initial impressions indicate that this approach may hold many operational benefits with respect to geographic data processing. Fulfilment of this promise with a commercial GIS product remains to be seen.

CHAPTER 3 – DATA SOURCES

3.1. SOURCES OF DATA

As previously identified, two types of data are input into a GIS, spatial and attribute. The data input process is the operation of encoding both types of data into the GIS database formats.

The creation of a clean digital database is the most important and time consuming task upon which the usefulness of the GIS depends. The establishment and maintenance of a robust spatial database is the cornerstone of a successful GIS implementation.

As well, the digital data is the most expensive part of the GIS. Yet often, not enough attention is given to the quality of the data or the processes by which they are prepared for automation.

The general consensus among the GIS community is that 60 to 80 % of the cost incurred during implementation of GIS technology lies in data acquisition, data compilation and database development.

A wide variety of data sources exist for both spatial and attribute data. The most common general sources for spatial data are:

	hard copy maps;
	aerial photographs;
	remotely-sensed imagery;
	point data samples from surveys; and
	existing digital data files.

Existing hard copy maps, e.g. sometimes referred to as analogue maps, provide the most popular source for any GIS project.

Potential users should be aware that while there are many private sector firms specializing in providing digital data, federal, provincial and state government agencies are an excellent source of data. Because of the large costs associated with data capture and input, government departments are often the only agencies with financial resources and manpower funding to invest in data compilation. British Columbia and Alberta government agencies are good examples. Both provincial governments have defined and implemented province wide coverage of digital base map data at varying map scales, e.g. 1:20,000 and 1:250,000. As well, the provincial forestry agencies also provide thematic forest inventory data in digital format. Federal agencies are also often a good source for base map information. An inherent advantage of digital data from government agencies is its cost. It is typically inexpensive. However, this is often offset by the data’s accuracy and quality. Thematic coverages are often not up to date. However, it is important to note that specific characteristics of government data varies greatly across North America.

Attribute data has an even wider variety of data sources. Any textual or tabular data than can be referenced to a geographic feature, e.g. a point, line, or area, can be input into a GIS. Attribute data is usually input by manual keying or via a bulk loading utility of the DBMS software. ASCII format is a de facto standard for the transfer and conversion of attribute information.

The following figure describes the basic data types that are used and created by a GIS.

The basic data types that are used and created by a GIS (after Berry).

3.2. DATA INPUT TECHNIQUES

Since the input of attribute data is usually quite simple, the discussion of data input techniques will be limited to spatial data only. There is no single method of entering the spatial data into a GIS. Rather, there are several, mutually compatible methods that can be used singly or in combination.

The choice of data input method is governed largely by the application, the available budget, and the type and the complexity of data being input.

There are at least four basic procedures for inputting spatial data into a GIS. These are:

	Manual digitizing;
	Automatic scanning;
	Entry of coordinates using coordinate geometry; and the
	Conversion of existing digital data.

Digitizing

While considerable work has been done with newer technologies, the overwhelming majority of GIS spatial data entry is done by manual digitizing. A digitizer is an electronic device consisting of a table upon which the map or drawing is placed. The user traces the spatial features with a hand-held magnetic pen, often called a mouse or cursor. While tracing the features the coordinates of selected points, e.g. vertices, are sent to the computer and stored. All points that are recorded are registered against positional control points, usually the map corners, that are keyed in by the user at the beginning of the digitizing session. The coordinates are recorded in a user defined coordinate system or map projection. Latitude and longitude and UTM is most often used. The ability to adjust or transform data during digitizing from one projection to another is a desirable function of the GIS software. Numerous functional techniques exist to aid the operator in the digitizing process.

Digitizing can be done in a point mode, where single points are recorded one at a time, or in a stream mode, where a point is collected on regular intervals of time or distance, measured by an X and Y movement, e.g. every 3 metres. Digitizing can also be done blindly or with a graphics terminal. Blind digitizing infers that the graphic result is not immediately viewable to the person digitizing. Most systems display the digitized linework as it is being digitized on an accompanying graphics terminal.

Most GIS’s use a spaghetti mode of digitizing. This allows the user to simply digitize lines by indicating a start point and an end point. Data can be captured in point or stream mode. However, some systems do allow the user to capture the data in an arc/node topological data structure. The arc/node data structure requires that the digitizer identify nodes.

Data capture in an arc/node approach helps to build a topologic data structure immediately. This lessens the amount of post processing required to clean and build the topological definitions. However, most often digitizing with an arc/node approach does not negate the requirement for editing and cleaning of the digitized linework before a complete topological structure can be obtained.

The building of topology is primarily a post-digitizing process that is commonly executed in batch mode after data has been cleaned. To date, only a few commercial vector GIS software offerings have successfully exhibited the capability to build topology interactively while the user digitizes.

Manual digitizing has many advantages. These include:

	Low capital cost, e.g. digitizing tables are cheap;
	Low cost of labour;
	Flexibility and adaptability to different data types and sources;
	Easily taught in a short amount of time – an easily mastered skill
	Generally the quality of data is high;
	Digitizing devices are very reliable and most often offer a greater precision that the data warrants; and
	Ability to easily register and update existing data.

For raster based GIS software data is still commonly digitized in a vector format and converted to a raster structure after the building of a clean topological structure. The procedure usually differs minimally from vector based software digitizing, other than some raster systems allow the user to define the resolution size of the grid-cell. Conversion to the raster structure may occur on-the-fly or afterwards as a separate conversion process.

Automatic Scanning

A variety of scanning devices exist for the automatic capture of spatial data. While several different technical approaches exist in scanning technology, all have the advantage of being able to capture spatial features from a map at a rapid rate of speed. However, as of yet, scanning has not proven to be a viable alternative for most GIS implementation. Scanners are generally expensive to acquire and operate. As well, most scanning devices have limitations with respect to the capture of selected features, e.g. text and symbol recognition. Experience has shown that most scanned data requires a substantial amount of manual editing to create a clean data layer. Given these basic constraints some other practical limitations of scanners should be identified. These include :

	hard copy maps are often unable to be removed to where a scanning device is available, e.g. most companies or agencies cannot afford their own scanning device and therefore must send their maps to a private firm for scanning;
	hard copy data may not be in a form that is viable for effective scanning, e.g. maps are of poor quality, or are in poor condition;
	geographic features may be too few on a single map to make it practical, cost-justifiable, to scan;
	often on busy maps a scanner may be unable to distinguish the features to be captured from the surrounding graphic information, e.g. dense contours with labels;
	with raster scanning there it is difficult to read unique labels (text) for a geographic feature effectively; and
	scanning is much more expensive than manual digitizing, considering all the cost/performance issues.

Consensus within the GIS community indicates that scanners work best when the information on a map is kept very clean, very simple, and uncluttered with graphic symbology.

The sheer cost of scanning usually eliminates the possibility of using scanning methods for data capture in most GIS implementations. Large data capture shops and government agencies are those most likely to be using scanning technology.

Currently, general consensus is that the quality of data captured from scanning devices is not substantial enough to justify the cost of using scanning technology. However, major breakthroughs are being made in the field, with scanning techniques and with capabilities to automatically clean and prepare scanned data for topological encoding. These include a variety of line following and text recognition techniques. Users should be aware that this technology has great potential in the years to come, particularly for larger GIS installations.

Coordinate Geometry

A third technique for the input of spatial data involves the calculation and entry of coordinates using coordinate geometry (COGO) procedures. This involves entering, from survey data, the explicit measurement of features from some known monument. This input technique is obviously very costly and labour intensive. In fact, it is rarely used for natural resource applications in GIS. This method is useful for creating very precise cartographic definitions of property, and accordingly is more appropriate for land records management at the cadastral or municipal scale.

Conversion of Existing Digital Data

A fourth technique that is becoming increasingly popular for data input is the conversion of existing digital data. A variety of spatial data, including digital maps, are openly available from a wide range of government and private sources. The most common digital data to be used in a GIS is data from CAD systems. A number of data conversion programs exist, mostly from GIS software vendors, to transform data from CAD formats to a raster or topological GIS data format. Several ad hoc standards for data exchange have been established in the market place. These are supplemented by a number of government distribution formats that have been developed. Given the wide variety of data formats that exist, most GIS vendors have developed and provide data exchange/conversion software to go from their format to those considered common in the market place.

Most GIS software vendors also provide an ASCII data exchange format specific to their product, and a programming subroutine library that will allow users to write their own data conversion routines to fulfil their own specific needs. As digital data becomes more readily available this capability becomes a necessity for any GIS. Data conversion from existing digital data is not a problem for most technical persons in the GIS field. However, for smaller GIS installations who have limited access to a GIS analyst this can be a major stumbling block in getting a GIS operational. Government agencies are usually a good source for technical information on data conversion requirements.

Some of the data formats common to the GIS marketplace are listed below. Please note that most formats are only utilized for graphic data. Attribute data is usually handled as ASCII text files. Vendor names are supplied where appropriate.

IGDS – Interactive Graphics Design Software (Intergraph / Microstation)	This binary format is a standard in the turnkey CAD market and has become a de facto standard in Canada’s mapping industry. It is a proprietary format, however most GIS software vendors provide DGN translators.
DLG – Digital Line Graph (US Geological Survey)	This ASCII format is used by the USGS as a distribution standard and consequently is well utilized in the United States. It is not used very much in Canada even though most software vendors provide two way conversion to DLG.
DXF – Drawing Exchange Format (Autocad)	This ASCII format is used primarily to convert to/from the Autocad drawing format and is a standard in the engineering discipline. Most GIS software vendors provide a DXF translator.
GENERATE – ARC/INFO Graphic Exchange Format	A generic ASCII format for spatial data used by the ARC/INFO software to accommodate generic spatial data.
EXPORT – ARC/INFO Export Format .	An exchange format that includes both graphic and attribute data. This format is intended for transferring ARC/INFO data from one hardware platform, or site, to another. It is also often used for archiving. ARC/INFO data. This is not a published data format, however some GIS and desktop mapping vendors provide translators. EXPORT format can come in either uncompressed, partially compressed, or fully compressed format

A wide variety of other vendor specific data formats exist within the mapping and GIS industry. In particular, most GIS software vendors have their own proprietary formats. However, almost all provide data conversion to/from the above formats. As well, most GIS software vendors will develop data conversion programs dependant on specific requests by customers. Potential purchasers of commercial GIS packages should determine and clearly identify their data conversion needs, prior to purchase, to the software vendor.

3.3. DATA EDITING AND QUALITY ASSURANCE

Data editing and verification is in response to the errors that arise during the encoding of spatial and non-spatial data. The editing of spatial data is a time consuming, interactive process that can take as long, if not longer, than the data input process itself.

Several kinds of errors can occur during data input. They can be classified as:

	Incompleteness of the spatial data. This includes missing points, line segments, and/or polygons.
	Locational placement errors of spatial data. These types of errors usually are the result of careless digitizing or poor quality of the original data source.
	Distortion of the spatial data. This kind of error is usually caused by base maps that are not scale-correct over the whole image, e.g. aerial photographs, or from material stretch, e.g. paper documents.
	Incorrect linkages between spatial and attribute data. This type of error is commonly the result of incorrect unique identifiers (labels) being assigned during manual key in or digitizing. This may involve the assigning of an entirely wrong label to a feature, or more than one label being assigned to a feature.
	Attribute data is wrong or incomplete. Often the attribute data does not match exactly with the spatial data. This is because they are frequently from independent sources and often different time periods. Missing data records or too many data records are the most common problems.

The identification of errors in spatial and attribute data is often difficult. Most spatial errors become evident during the topological building process. The use of check plots to clearly determine where spatial errors exist is a common practice. Most topological building functions in GIS software clearly identify the geographic location of the error and indicate the nature of the problem. Comprehensive GIS software allows users to graphically walk through and edit the spatial errors. Others merely identify the type and coordinates of the error. Since this is often a labour intensive and time consuming process, users should consider the error correction capabilities very important during the evaluation of GIS software offerings.

Spatial Data Errors

A variety of common data problems occur in converting data into a topological structure. These stem from the original quality of the source data and the characteristics of the data capture process. Usually data is input by digitizing. Digitizing allows a user to trace spatial data from a hard copy product, e.g. a map, and have it recorded by the computer software. Most GIS software has utilities to clean the data and build a topologic structure. If the data is unclean to start with, for whatever reason, the cleaning process can be very lengthy. Interactive editing of data is a distinct reality in the data input process.

Experience indicates that in the course of any GIS project 60 to 80 % of the time required to complete the project is involved in the input, cleaning, linking, and verification of the data.

The most common problems that occur in converting data into a topological structure include:

	slivers and gaps in the line work;
	dead ends, e.g. also called dangling arcs, resulting from overshoots and undershoots in the line work; and
	bow ties or weird polygons from inappropriate closing of connecting features.

Of course, topological errors only exist with linear and areal features. They become most evident with polygonal features. Slivers are the most common problem when cleaning data. Slivers frequently occur when coincident boundaries are digitized separately, e.g. once each for adjacent forest stands, once for a lake and once for the stand boundary, or after polygon overlay. Slivers often appear when combining data from different sources, e.g. forest inventory, soils, and hydrography. It is advisable to digitize data layers with respect to an existing data layer, e.g. hydrography, rather than attempting to match data layers later. A proper plan and definition of priorities for inputting data layers will save many hours of interactive editing and cleaning.

Dead ends usually occur when data has been digitized in a spaghetti mode, or without snapping to existing nodes. Most GIS software will clean up undershoots and overshoots based on a user defined tolerance, e.g. distance. The definition of an inappropriate distance often leads to the formation of bow ties or weird polygons during topological building. Tolerances that are too large will force arcs to snap one another that should not be connected. The result is small polygons called bow ties. The definition of a proper tolerance for cleaning requires an understanding of the scale and accuracy of the data set.

The other problem that commonly occurs when building a topologic data structure is duplicate lines. These usually occur when data has been digitized or converted from a CAD system. The lack of topology in these type of drafting systems permits the inadvertent creation of elements that are exactly duplicate. However, most GIS packages afford automatic elimination of duplicate elements during the topological building process. Accordingly, it may not be a concern with vector based GIS software. Users should be aware of the duplicate element that retraces itself, e.g. a three vertice line where the first point is also the last point. Some GIS packages do not identify these feature inconsistencies and will build such a feature as a valid polygon. This is because the topological definition is mathematically correct, however it is not geographically correct. Most GIS software will provide the capability to eliminate bow ties and slivers by means of a feature elimination command based on area, e.g. polygons less than 100 square metres. The ability to define custom topological error scenarios and provide for semi-automated correction is a desirable capability for GIS software.

The adjoining figure illustrates some typical errors described above. Can you spot them ? They include undershoots, overshoots, bow ties, and slivers. Most bow ties occur when inappropriate tolerances are used during the automated cleaning of data that contains many overshoots. This particular set of spatial data is a prime candidate for numerous bow tie polygons.

Attribute Data Errors

The identification of attribute data errors is usually not as simple as spatial errors. This is especially true if these errors are attributed to the quality or reliability of the data. Errors as such usually do not surface until later on in the GIS processing. Solutions to these type of problems are much more complex and often do not exist entirely. It is much more difficult to spot errors in attribute data when the values are syntactically good, but incorrect.

Simple errors of linkage, e.g. missing or duplicate records, become evident during the linking operation between spatial and attribute data. Again, most GIS software contains functions that check for and clearly identify problems of linkage during attempted operations. This is also an area of consideration when evaluating GIS software.

Data Verification

Six clear steps stand out in the data editing and verification process for spatial data. These are:

	Visual review. This is usually by check plotting.
	Cleanup of lines and junctions. This process is usually done by software first and interactive editing second.
	Weeding of excess coordinates. This process involves the removal of redundant vertices by the software for linear and/or polygonal features.
	Correction for distortion and warping. Most GIS software has functions for scale correction and rubber sheeting. However, the distinct rubber sheet algorithm used will vary depending on the spatial data model, vector or raster, employed by the GIS. Some raster techniques may be more intensive than vector based algorithms.
	Construction of polygons. Since the majority of data used in GIS is polygonal, the construction of polygon features from lines/arcs is necessary. Usually this is done in conjunction with the topological building process.
	The addition of unique identifiers or labels. Often this process is manual. However, some systems do provide the capability to automatically build labels for a data layer.

These data verification steps occur after the data input stage and prior to or during the linkage of the spatial data to the attributes. Data verification ensures the integrity between the spatial and attribute data. Verification should include some brief querying of attributes and cross checking against known values.

CHAPTER 4 -DATA ORGANIZATION AND STORAGE

4.1 INTRODUCTION

This chapter reviews the approaches for organizing and maintaining data in a GIS. The focus is on reviewing different techniques for storing spatial data. A brief review of data querying approaches for attribute data is also provided.

	Organizing Data for Analysis
	Editing and Updating of Data

The second necessary component for a GIS is the data storage and retrieval subsystem. This subsystem organizes the data, both spatial and attribute, in a form which permits it to be quickly retrieved for updating, querying, and analysis. Most GIS software utilizes proprietary software for their spatial editing and retrieval system, and a database management system (DBMS) for their attribute storage. Typically, an internal data model is used to store primary attribute data associated with the topological definition of the spatial data. Most often these internal database tables contain primary columns such as area, perimeter, length, and internal feature id number. Often thematic attribute data is maintained in an external DBMS that is linked to the spatial data via the internal database table.

4.2 ORGANIZING DATA FOR ANALYSIS

Most GIS software organizes spatial data in a thematic approach that categorizes data in vertical layers. The definition of layers is fully dependent on the organization’s requirements. Typical layers used in natural resource management agencies or companies include forest cover, soil classification, elevation, road network (access), ecological areas, hydrology, etc.

Spatial data layers are commonly input one at a time, e.g. forest cover. Accordingly, attribute data is entered one layer at a time. Depending on the attribute data model used by the data storage subsystem data must be organized in a format that will facilitate the manipulation and analysis tasks that will be required. Most often, the spatial and attribute data may be entered at different times and linked together later. However, this is fully dependent on the source of data.

The clear identification of the requirements for any GIS project is necessary before any data input procedures, and/or layer definitions, should occur.

It is mandatory that GIS users fully understand their needs before undertaking a GIS project.

Experience has shown that a less than complete understanding of the needs and processing tasks required for a specific project, greatly increases the time required to complete the project, and ultimately affects the quality and reliability of the derived GIS product(s).

4.3 SPATIAL DATA LAYERS – VERTICAL DATA ORGANIZATION

In most GIS software data is organized in themes as data layers. This approach allows data to be input as separate themes and overlaid based on analysis requirements. This can conceptualized as vertical layering the characteristics of the earth’s surface. The overlay concept is so natural to cartographers and natural resource specialists that it has been built into the design of most CAD vector systems as well. The overlay/layer approach used in CAD systems is used to separate major classes of spatial features. This concept is also used to logically order data in most GIS software. The terminology may differ between GIS software, but the approach is the same. A variety of terms are used to define data layers in commercial GIS software. These include themes, coverages, layers, levels, objects, and feature classes. Data layer and theme are the most common and the least proprietary to any particular GIS software and accordingly, as used throughout the book.

In any GIS project a variety of data layers will be required. These must be identified before the project is started and a priority given to the input or digitizing of the spatial data layers. This is mandatory, as often one data layer contains features that are coincident with another, e.g. lakes can be used to define polygons within the forest inventory data layer. Data layers are commonly defined based on the needs of the user and the availability of data. They are completely user definable.

The definition of data layers is fully dependent on the area of interest and the priority needs of the GIS. Layer definitions can vary greatly depending on the intended needs of the GIS.

When considering the physical requirements of the GIS software it is important to understand that two types of data are required for each layer, attribute and spatial data. Commonly, data layers are input into the GIS one layer at a time. As well, often a data layer is completely loaded, e.g. graphic conversion, editing, topological building, attribute conversion, linking, and verification, before the next data layer is started. Because there are several steps involved in completely loading a data layer it can become very confusing if many layers are loaded at once.

The proper identification of layers prior to starting data input is critical. The identification of data layers is often achieved through a user needs analysis. The user needs analysis performs several functions including:

	identifying the users;
	educating users with respect to GIS needs;
	identifying information products;
	identifying data requirements for information products;
	priorizing data requirements and products; and
	determining GIS functional requirements.

Often a user needs assessment will include a review of existing operations, e.g. sometimes called a situational assessment, and a cost-benefit analysis. The cost-benefit process is well established in conventional data processing and serves as the mechanism to justify acquisition of hardware and software. It defines and compares costs against potential benefits. Most institutions will require this step before a GIS acquisition can be undertaken.

Most GIS projects integrate data layers to create derived themes or layers that represent the result of some calculation or geographic model, e.g. forest merchantability, land use suitability, etc. Derived data layers are completely dependant on the aim of the project.

Each data layer would be input individually and topologically integrated to create combined data layers. Based on the data model, e.g. vector or raster, and the topological structure, selected data analysis functions could be undertaken. It is important to note that in vector based GIS software the topological structure defined can only be traversed by means of unique labels to every feature.

Spatial Indexing – Horizontal Data Organization

The proprietary organization of data layers in a horizontal fashion within a GIS is known as spatial indexing. Spatial indexing is the method utilized by the software to store and retrieve spatial data. A variety of different strategies exist for speeding up the spatial feature retrieval process within a GIS software product. Most involve the partitioning of the geographic area into manageable subsets or tiles. These tiles are then indexed mathematically, e.g. by quadtrees, by R (rectangle) trees, to allow for quick searching and retrieval when querying is initiated by a user. Spatial indexing is analygous to the definition of map sheets, except that specific indexing techniques are used to access data across map sheet (tile) boundaries. This is done simply to improve query performance for large data sets that span multiple map sheets, and to ensure data integrity across map sheet boundaries.

The method and process of spatial indexing is usually transparent to the user. However, it becomes very important especially when large data sets are utilized. The notion of spatial indexing has become increasingly important in the design of GIS software over the last few years, as larger scale applications have been initiated using GIS technology. Users have found that often the response time in querying very large data sets is unacceptably slow. GIS software vendors have responded by developing sophisticated algorithms to index and retrieve spatial data. It is important to note that raster systems, by the nature of their data structure, do not typically require a spatial indexing method. The raster approach imposes regular, readily addressable partitions on the data universe intrinsically with its data structure. Accordingly, spatial indexing is usually not required. However, the more sophisticated vector GIS does require a method to quickly retrieve spatial objects.

The above diagram illustrates a typical map library that is compiled for an area of interest. The ‘forest cover’ layer is shown for 6 sample tiles to illustrate how data is transparently stored in a map library using a spatial index.

The horizontal indexing of spatial data within GIS software involves several issues. These concern the extent of the spatial indexing approach. They include:

	the use of a librarian subsystem to organize data for users;
	the requirement for a formal definition of layers;
	the need for feature coding within themes or layers; and
	requirements to maintain data integrity through transaction control, e.g. the locking of selected spatial tiles (or features) when editing is being undertaken by a permitted user.

While all these issues need not be satisfied for spatial indexing to occur, they are important aspects users should consider when evaluating GIS software.

While the spatial indexing method is usually not the selling point of any GIS, users should consider these requirements, especially if very large data sets, e.g. 10,000 + polygons, are to be the norm in their applications, and a vector data model is to be employed.

4.4 EDITING AND UPDATING OF DATA

Perhaps the primary function in the data storage and retrieval subsystem involves the editing and updating of data. Frequently, the following data editing capabilities are required:

	interactive editing of spatial data;
	interactive editing of attribute data;
	the ability to add, manipulate, modify, and delete both spatial features and attributes (independently or simultaneously) ; and the
	ability to edit selected features in a batch processing mode.

Updating involves more than the simple editing of features. Updating implies the resurvey and processing of new information. The updating function is of great importance during any GIS project. The life span of most digital data can range anywhere from 1 to 10 years. Commonly, digital data is valid for 5 to 10 years. The lengthy time span is due to the intensive task of data capture and input. However, often periodic data updates are required. These frequently involve an increased accuracy and/or detail of the data layer. Changes in classification standards and procedures may necessitate such updates. Updates to a forest cover data layer to reflect changes from a forest fire burn or a harvest cut are typical examples.

Many times data updates are required based on the results of a derived GIS product. The generation of a derived product may identify blatant errors or inappropriate classes for a particular layer. When this occurs updating of the data is required. In this situation the GIS operator usually has some previous experience or knowledge of the study area.

Commonly the data update process is a result of a physical change in the geographic landscape. Forest fires are a prime example. With this type of update new features are usually required for the data layer, e.g. burn polygons. As well, existing features are altered, e.g. forest stands that were affected. There is a strong requirement for a historical record keeping capability with this type of update process. Users should be aware of this requirement and design their database organization to accommodate such needs. Depending on the particular GIS, the update process may involve some data manipulation and analysis functions.

4.5 DATA RETRIEVAL AND QUERYING

The ability to query and retrieve data based on some user defined criteria is a necessary feature of the data storage and retrieval subsystem.

Data retrieval involves the capability to easily select data for graphic or attribute editing, updating, querying, analysis and/or display.

The ability to retrieve data is based on the unique structure of the DBMS and command interfaces are commonly provided with the software. Most GIS software also provides a programming subroutine library, or macro language, so the user can write their own specific data retrieval routines if required.

Querying is the capability to retrieve data, usually a data subset, based on some user defined formula. These data subsets are often referred to as logical views. Often the querying is closely linked to the data manipulation and analysis subsystem. Many GIS software offerings have attempted to standardize their querying capability by use of a Standard Query Language (SQL). This is especially true with systems that make use of an external relational DBMS. Through the use of SQL, GIS software can interface to a variety of different DBMS packages. This approach provides the user with the flexibility to select their own DBMS. This has direct implications if the organization has an existing DBMS that is being used for to satisfy other business requirements. Often it is desirable for the same DBMS to be utilized in the GIS applications. This notion of integrating the GIS software to utilize an existing DBMS through standards is referred to as corporate or enterprise GIS. With the migration of GIS technology from being a research tool to being a decision support tool there is a requirement for it to be totally integrated with existing corporate activities, including accounting, reporting, and business functions.

There is a definite trend in the GIS marketplace towards a generic interface with external relational DBMS’s. The use of an external DBMS, linked via a SQL interface, is becoming the norm. A flexibility as such is a strong selling point for any GIS. SQL is quickly becoming a standard in the GIS software marketplace

CHAPTER 5 – DATA ANALYSIS

5.1 INTRODUCTION

This chapter reviews data manipulation and analysis capabilities within a GIS. The focus is on reviewing spatial data analytical functions. This chapter categorizes analytical functions within a GIS and will be of most interest to technical staff and GIS operators.

	Manipulation and Transformation of Spatial Data
	Integration and Modeling of Spatial Data
	Integrated Analytical Functions

The major difference between GIS software and CAD mapping software is the provision of capabilities for transforming the original spatial data in order to be able to answer particular queries. Some transformation capabilities are common to both GIS and CAD systems, however, GIS software provides a larger range of analysis capabilities that will be able to operate on the topology or spatial aspects of the geographic data, on the non-spatial attributes of these data, or on both.

The main criteria used to define a GIS is its capability to transform and integrate spatial data.

5.2 MANIPULATION AND TRANSFORMATION OF SPATIAL DATA

The maintenance and transformation of spatial data concerns the ability to input, manipulate, and transform data once it has been created. While many different interpretations exist with respect to what constitutes these capabilities some specific functions can be identified. These are reviewed below.

Coordinate Thinning

Coordinate thinning involves the weeding or reduction of coordinate pairs, e.g. X and Y, from arcs. This function is often required when data has been captured with too many vertices for the linear features. This can result in redundant data and large data volumes. The weeding of coordinates is required to reduce this redundancy.

The thinning of coordinates is also required in the map generalization process of linear simplification. Linear simplification is one component of generalization that is required when data from one scale, e.g. 1:20,000, is to be used and integrated with data from another scale, e.g. 1:100,000. Coordinate thinning is often done on features such as contours, hydrography, and forest stand boundaries.

Geometric Transformations

This function is concerned with the registering of a data layer to a common coordinate scheme. This usually involves registering selected data layers to a standard data layer already registered. The term rubber sheeting is often used to describe this function. Rubber sheeting involves stretching one data layer to meet another based on predefined control points of known locations. Two other functions may be categorized under geometric transformations. These involve warping a data layer stored in one data model, either raster or vector, to another data layer stored in the opposite data model. For example, often classified satellite imagery may require warping to fit an existing forest inventory layer, or a poor quality vector layer may require warping to match a more accurate raster layer.

Map Projection Transformations

This functionality concerns the transformation of data in geographic coordinates for an existing map projection to another map projection. Most GIS software requires that data layers must be in the same map projection for analysis. Accordingly, if data is acquired in a different projection than the other data layers it must be transformed. Typically 20 or more different map projections are supported in a GIS software offering.

Conflation – Sliver Removal

Conflation is formally defined as the procedure of reconciling the positions of corresponding features in different data layers. More commonly this is referred to as sliver removal. Often two layers that contain the same feature, e.g. soils and forest stands both with a specific lake, do not have exactly the same boundaries for that feature, e.g. the lake. This may be caused by a lack of coordination or data prioritization during digitizing or by a number of different manipulation and analysis techniques. When the two layers are combined, e.g. normally in polygon overlay, they will not match precisely and small sliver polygons will be created. Conflation is concerned with the process for removing these slivers and reconciling the common boundary.

There are several approaches for sliver removal. Perhaps the most common is allowing the user to define a priority for data layers in combination with a tolerance value. Considering the soils and forest stand example the user could define a layer that takes precedence, e.g. forest stands, and a size tolerance for slivers. After polygon overlay if a polygon is below the size tolerance it is classified a sliver. To reconcile the situation the arcs of the data layer that has higher priority will be retained and the arcs of the other data layer will be deleted. Another approach is to simply divide the sliver down the centre and collapse the arcs making up the boundary. The important point is that all GIS software must have the capability to resolve slivers. Remember that it is generally much less expensive to reconcile maps manually in the map preparation and digitizing stage than afterwards.

Edge Matching

Edge matching is simply the procedure to adjust the position of features that extend across typical map sheet boundaries. Theoretically data from adjacent map sheets should meet precisely at map edges. However, in practice this rarely occurs. Misalignment of features can be caused by several factors including digitizing error, paper shrinkage of source maps, and errors in the original mapping. Edge matching always requires some interactive editing. Accordingly, GIS software differs considerably in the degree of automation provided.

Interactive Graphic Editing

Interactive graphic editing functions involve the addition, deletion, moving, and changing of the geographic position of features. Editing should be possible at any time. Most graphic editing occurs during the data compilation phase of any project. Remember typically 60 to 70 % of the time required to complete any project involves data compilation. Accordingly, the level of sophistication and ease of use of this capability is vitally important and should be rated highly by those evaluating GIS software. Many of the editing that is undertaken involves the cleaning up of topological errors identified earlier. The capability to snap to existing elements, e.g. nodes and arcs, is critical.

The functionality of graphic editing does not differ greatly across GIS software offerings. However, the user interface and ease of use of the editing functions usually does. Editing within a GIS software package should be as easy as using a CAD system. A cumbersome or incomplete graphic editing capability will lead to much frustration by the users of the software.

5.3 INTEGRATION AND MODELLING OF SPATIAL DATA

The integration of data provides the ability to ask complex spatial questions that could not be answered otherwise. Often, these are inventory or locational questions such as how much ? or where ?. Answers to locational and quantitative questions require the combination of several different data layers to be able to provide a more complete and realistic answer. The ability to combine and integrate data is the backbone of GIS.

Often, applications do require a more sophisticated approach to answer complex spatial queries and what if ? scenarios. The technique used to solve these questions is called spatial modelling. Spatial modelling infers the use of spatial characteristics and methods in manipulating data. Methods exist to create an almost unlimited range of capabilities for data analysis by stringing together sets of primitive analysis functions. While some explicit analytical models do exist, especially in natural resource applications, most modelling formulae (models) are determined based on the needs of a particular project. The capability to undertake complex modelling of spatial data, on an ad hoc basis, has helped to further the resource specialists understanding of the natural environment, and the relationship between selected characteristics of that environment.

The use of GIS spatial modelling tools in several traditional resource activities has helped to quantify processes and define models for deriving analysis products. This is particularly true in the area of resource planning and inventory compilation. Most GIS users are able to better organize their applications because of their interaction with, and use of, GIS technology. The utilization of spatial modelling techniques requires a comprehensive understanding of the data sets involved, and the analysis requirements.

The critical function for any GIS is the integration of data.

The raster data model has become the primary spatial data source for analytical modeling with GIS. The raster data model is well suited to the quantitative analysis of numerous data layers. To facilitate these raster modeling techniques most GIS software employs a separate module specifically for cell processing.

The following diagram represents a logic flowchart of a typical natural resource model using GIS raster modeling techniques. The boxes represent raster maps in the GIS, while the connection lines imply an analytical function or technique (from Berry).

5.4 INTEGRATED ANALYTICAL FUNCTIONS IN A GIS

Most GIS’s provide the capability to build complex models by combining primitive analytical functions. Systems vary as to the complexity provided for spatial modelling, and the specific functions that are available. However, most systems provide a standard set of primitive analytical functions that are accessible to the user in some logical manner. Aronoff identifies four categories of GIS analysis functions. These are:

	Retrieval, Reclassification, and Generalization;
	Topological Overlay Techniques;
	Neighbourhood Operations; and
	Connectivity Functions.

The range of analysis techniques in these categories is very large. Accordingly, this section of the book focuses on providing an overview of the fundamental primitive functions that are most often utilized in spatial analyses.

Retrieval, Reclassification and Generalization

Perhaps the initial GIS analysis that any user undertakes is the retrieval and/or reclassification of data. Retrieval operations occur on both spatial and attribute data. Often data is selected by an attribute subset and viewed graphically. Retrieval involves the selective search, manipulation, and output of data without the requirement to modify the geographic location of the features involved.

Reclassification involves the selection and presentation of a selected layer of data based on the classes or values of a specific attribute, e.g. cover group. It involves looking at an attribute, or a series of attributes, for a single data layer and classifying the data layer based on the range of values of the attribute. Accordingly, features adjacent to one another that have a common value, e.g. cover group, but differ in other characteristics, e.g. tree height, species, will be treated and appear as one class. In raster based GIS software, numerical values are often used to indicate classes. Reclassification is an attribute generalization technique. Typically this function makes use of polygon patterning techniques such as crosshatching and/or color shading for graphic representation.

In a vector based GIS, boundaries between polygons of common reclassed values should be dissolved to create a cleaner map of homogeneous continuity. Raster reclassification intrinsically involves boundary dissolving. The dissolving of map boundaries based on a specific attribute value often results in a new data layer being created. This is often done for visual clarity in the creation of derived maps. Almost all GIS software provides the capability to easily dissolve boundaries based on the results of a reclassification. Some systems allow the user to create a new data layer for the reclassification while others simply dissolve the boundaries during data output.

One can see how the querying capability of the DBMS is a necessity in the reclassification process. The ability and process for displaying the results of reclassification, a map or report, will vary depending on the GIS. In some systems the querying process is independent from data display functions, while in others they are integrated and querying is done in a graphics mode. The exact process for undertaking a reclassification varies greatly from GIS to GIS. Some will store results of the query in query sets independent from the DBMS, while others store the results in a newly created attribute column in the DBMS. The approach varies drastically depending on the architecture of the GIS software.

Topological Overlay

The capability to overlay multiple data layers in a vertical fashion is the most required and common technique in geographic data processing. In fact, the use of a topological data structure can be traced back to the need for overlaying vector data layers. With the advent of the concepts of mathematical topology polygon overlay has become the most popular geoprocessing tool, and the basis of any functional GIS software package.

Topological overlay is predominantly concerned with overlaying polygon data with polygon data, e.g. soils and forest cover. However, there are requirements for overlaying point, linear, and polygon data in selected combinations, e.g. point in polygon, line in polygon, and polygon on polygon are the most common. Vector and raster based software differ considerably in their approach to topological overlay.

Raster based software is oriented towards arithmetic overlay operations, e.g. the addition, subtraction, division, multiplication of data layers. The nature of the one attribute map approach, typical of the raster data model, usually provides a more flexible and efficient overlay capability. The raster data model affords a strong numerically modelling (quantitative analysis) modelling capability. Most sophisticated spatial modelling is undertaken within the raster domain.

In vector based systems topological overlay is achieved by the creation of a new topological network from two or more existing networks. This requires the rebuilding of topological tables, e.g. arc, node, polygon, and therefore can be time consuming and CPU intensive. The result of a topological overlay in the vector domain is a new topological network that will contain attributes of the original input data layers. In this way selected queries can then be undertaken of the original layer, e.g. soils and forest cover, to determine where specific situations occur, e.g. deciduous forest cover where drainage is poor.

Most GIS software makes use of a consistent logic for the overlay of multiple data layers. The rules of Boolean logic are used to operate on the attributes and spatial properties of geographic features. Boolean algebra uses the operators AND, OR, XOR, NOT to see whether a particular condition is true or false. Boolean logic represents all possible combinations of spatial interaction between different features. The implementation of Boolean operators is often transparent to the user.

To date the primary analysis technique used in GIS applications, vector and raster, is the topological overlay of selected data layers.

Generally, GIS software implements the overlay of different vector data layers by combining the spatial and attribute data files of the layers to create a new data layer. Again, different GIS software utilize varying approaches for the display and reporting of overlay results. Some systems require that topological overlay occur on only two data layers at a time, creating a third layer.

This pairwise approach requires the nesting of multiple overlays to generate a final overlay product, if more than two data layers are involved. This can result in numerous intermediate or temporary data layers. Some systems create a complete topological structure at the data verification stage, and the user merely submits a query string for the combined topological data. Other systems allow the user to overlay multiple data layers at one time. Each approach has its drawbacks depending on the application and the nature of the implementation. Determining the most appropriate method is based on the type of application, practical considerations such as data volumes and CPU power, and other considerations such personnel and time requirements. Overall, the flexibility provided to the operator and the level of performance varies widely among GIS software offerings.

The following diagram illustrates a typical overlay requirements where several different layers are spatially joined to created a new topological layer. By combining multiple layers in a topological fashion complex queries can be answered concerning attributes of any layer.

5.5 NEIGHBOURHOOD OPERATIONS

Neighbourhood operations evaluate the characteristics of an area surrounding a specific location. Virtually all GIS software provides some form of neighbourhood analysis. A range of different neighbourhood functions exist. The analysis of topographic features, e.g. the relief of the landscape, is normally categorized as being a neighbourhood operation. This involves a variety of point interpolation techniques including slope and aspect calculations, contour generation, and Thiessen polygons. Interpolation is defined as the method of predicting unknown values using known values of neighbouring locations. Interpolation is utilized most often with point based elevation data.

This example illustrates a continuous surface that has been created by interpolating sample data points.

Elevation data usually takes the form of irregular or regular spaced points. Irregularly space points are stored in a Triangular Irregular Network (TIN). A TIN is a vector topological network of triangular facets generated by joining the irregular points with straight line segments. The TIN structure is utilized when irregular data is available, predominantly in vector based systems. TIN is a vector data model for 3-D data.

An alternative in storing elevation data is the regular point Digital Elevation Model (DEM). The term DEM usually refers to a grid of regularly space elevation points. These points are usually stored with a raster data model. Most GIS software offerings provide three dimensional analysis capabilities in a separate module of the software. Again, they vary considerably with respect to their functionality and the level of integration between the 3-D module and the other more typical analysis functions.

Without doubt the most common neighbourhood function is buffering. Buffering involves the ability to create distance buffers around selected features, be it points, lines, or areas. Buffers are created as polygons because they represent an area around a feature. Buffering is also referred to as corridor or zone generation with the raster data model. Usually, the results of a buffering process are utilized in a topological overlay with another data layer. For example, to determine the volume of timber within a selected distance of a cutline, the user would first buffer the cutline data layer. They would then overlay the resultant buffer data layer, a buffer polygon, with the forest cover data layer in a clipping fashion. This would result in a new data layer that only contained the forest cover within the buffer zone. Since all attributes are maintained in the topological overlay and buffering processes, a map or report could then be generated.

Buffering is typically used with point or linear features. The generation of buffers for selected features is frequently based on a distance from that feature, or on a specific attribute of that feature. For example, some features may have a greater zone of influence due to specific characteristics, e.g. a primary highway would generally have a greater influence than a gravel road. Accordingly, different size buffers can be generated for features within a data layer based on selected attribute values or feature types.

Connectivity Analysis

The distinguishing feature of connectivity operations is that they use functions that accumulate values over an area being traversed. Most often these include the analysis of surfaces and networks. Connectivity functions include proximity analysis, network analysis, spread functions, and three dimensional surface analysis such as visibility and perspective viewing. This category of analysis techniques is the least developed in commercial GIS software. Consequently, there is often a great difference in the functionality offered between GIS software offerings. Raster based systems often provide the more sophisticated surface analysis capabilities while vector based systems tend to focus on linear network analysis capabilities. However, this appears to be changing as GIS software becomes more sophisticated, and multi-disciplinary applications require a more comprehensive and integrated functionality. Some GIS offerings provide both vector and raster analysis capabilities. Only in these systems will one fund a full range of connectivity analysis techniques.

Proximity analysis techniques are primarily concerned with the proximity of one feature to another. Usually proximity is defined as the ability to identify any feature that is near any other feature based on location, attribute value, or a specific distance. A simple example is identifying all the forest stands that are within 100 metres of a gravel road, but not necessarily adjacent to it. It is important to note that neighbourhood buffering is often categorized as being a proximity analysis capability. Depending on the particular GIS software package, the data model employed, and the operational architecture of the software it may be difficult to distinguish proximity analysis and buffering.

Proximity analysis is often used in urban based applications to consider areas of influence, and ownership queries. Proximity to roads and engineering infrastructure is typically important for development planning, tax calculations, and utility billing.

The identification of adjacency is another proximity analysis function. Adjacency is defined as the ability to identify any feature having certain attributes that exhibit adjacency with other selected features having certain attributes. A typical example is the ability to identify all forest stands of a specific type, e.g. specie, adjacent to a gravel road.

Network analysis is a widely used analysis technique. Network analysis techniques can be characterized by their use of feature networks. Feature networks are almost entirely comprised of linear features. Hydrographic hierarchies and transportation networks are prime examples. Two example network analysis techniques are the allocation of values to selected features within the network to determine capacity zones, and the determination of shortest path between connected points or nodes within the network based on attribute values. This is often referred to as route optimization. Attribute values may be as simple as minimal distance, or more complex involving a model using several attributes defining rate of flow, impedance, and cost.

Three dimensional analysis involves a range of different capabilities. The most utilized is the generation of perspective surfaces. Perspective surfaces are usually represented by a wire frame diagram reflecting profiles of the landscape, e.g. every 100 metres. These profiles viewed together, with the removal of hidden lines, provide a three dimensional view. As previously identified, most GIS software packages offer 3-D capabilities in a separate module. Several other functions are normally available.

These include the following functions :

	user definable vertical exaggeration, viewing azimuth, and elevation angle;
	identification of viewsheds, e.g. seen versus unseen areas;
	the draping of features, e.g. point, lines, and shaded polygons onto the perspective surface;
	generation of shaded relief models simulating illumination;
	generation of cross section profiles;
	presentation of symbology on the 3-D surface; and
	line of sight perspective views from user defined viewpoints.

While the primitive analytical functions have been presented the reader should be aware that a wide range of more specific and detailed capabilities do exist.

The overriding theme of all GIS software is that the analytical functions are totally integrated with the DBMS component. This integration provides the necessary foundation for all analysis techniques.