Glottography Data Tutorials

From a language map image to a Glottography dataset in six tutorials

Overview

Georeferencing
Digitising
Attributes & Metadata
Glottocodes
Data Curation
Error Correction

View the Project on GitHub Glottography/tutorials

Digitising language polygons

Digitising is the process of tracing features from a georeferenced map and converting them into digital points, lines, or— in our case—polygons that a GIS can interpret. In this tutorial, we will digitise the language areas shown on the Alor-Pantar map by Schapper (2020), which we georeferenced in the Georeferencing tutorial. This tutorial focuses on copying the geometry of the language polygons from the map. For information on how to copy attributes (including Glottocodes) and metadata, see the Attributes and Metadata and the Glottocodes tutorial.

We will explore two different approaches to digitising language polygons from a language map.

A: Digitising Language Polygons fom scratch

This method involves manually drawing polygons one by one. It is quick and straightforward but can introduce geometric and topological inconsistencies, particularly where the language map does not align well with coastlines or landforms. This issue is more likely to occur with maps that have coarse spatial resolution or inaccurate georeferencing. Conversely, digitising from scratch is best suited to inland regions, where coastal accuracy is less important, or to high-resolution maps that have been accurately georeferenced.

B: Splitting language polygons from existing landforms

This method involves splitting language areas from an existing polygon dataset of continents and landforms. The resulting language polygons follow coastlines and land boundaries derived from high-resolution satellite imagery, which are often more accurate than those shown on a scanned or georeferenced map. However, a key limitation is that the language polygons are constrained by the boundaries of the existing dataset, which may omit small or irregular geographic features, such as islands or narrow coastal strips.

Finally, we merge the digitised individual polygons into Multipolygons grouped by language name. Some languages are represented by several disjoint polygons, which we combine into a single Multipolygon geometry for easier handling.

Requirements

Software: QGIS is a free and open-source geographic information system (GIS). This tutorial uses version QGIS 3.34.4-Prizren.

Data: A georeferenced map in GeoTIFF format. In this tutorial, we digitise the Alor-Pantar languages map from the Georeferencing tutorial, whose GeoTIFF can be downloaded here. For part B, we also use Land polygons, including major islands, from the 1:10m Physical Vectors dataset by Natural Earth. These data are provided as Shapefiles, a widely used legacy format for storing geographic vector data.

Digitising language polygons from scratch

Before digitising, we need to load the georeferenced raster map of the Alor Pantar languages we created in the Georeferencing tutorial. Go to Layer > Add Layer > Add Raster Layer… and locate the file, or drag and drop the file into the Layers panel.

We also open the Carto Basic basemap from the HCMGIS plugin as a spatial reference: HCMGIS > Basemaps > Vector tiles > Carto Basic.

Next, we create an empty polygon vector layer to store the digitised language areas. To do this, go to Layer > Create Layer > New GeoPackage Layer… to initialise a new GeoPackage file. GeoPackage (.gpkg) is a file format for storing geographic features and has become the de facto standard in QGIS. The file will act as a container for the language polygons we are about to digitise. In the Data curation tutorial, we will later convert the GeoPackage to a GeoJSON file, a lightweight, human-readable format for representing geographic features. While you could digitise the polygons directly in GeoJSON format, this tutorial uses the GeoPackage format because it offers greater flexibility with projections. In GeoJSON, for example, coordinates must be expressed in longitude and latitude using decimal degrees, corresponding to the EPSG:4326 coordinate reference system (CRS). This restriction does not apply to GeoPackages.

Creating a new GeoPackage layer.
Creating a new GeoPackage layer.

 

A dialog appears, prompting you to define the properties of the GeoPackage, including the geometry type (point, line, or polygon), the coordinate reference system (CRS), and the (non-spatial) attributes.

The GeoPackage dialog.
The GeoPackage dialog.

 

In the dialog box, click the ... button next to Database to choose a location and file name for the output GeoPackage file. Under Layer name, enter a name for the polygon layer. This name will appear in the QGIS Layers panel. Set the Geometry type to Polygon. Choose a Coordinate Reference System (CRS) for the layer. Here, we use the standard EPSG:4326 - WGS 84, though other CRS may be more appropriate depending on the region you want to digitise. Next, define the attribute fields to store information about each polygon. In the New Field section, add the following fields:

Field Name Type
name Text (string)
map_name_full Text (string)
year Text (string)
glottocode Text (string)
note Text (string)

For a detailed explanation of all attributes needed for Glottography polygons, see the Attributes and Metadata tutorial. In the Advanced section, change the name of the Feature id column to id. Once all fields are defined, click OK to create the GeoPackage layer.

Start Digitising

Ensure that your new GeoPackage layer is selected in the Layers panel. Enable Edit Mode by clicking the pencil icon in the toolbar, or right-click the layer and select Toggle Editing.

Toggle Editing.
Toggle Editing.

 

The icons in the Digitsing Toolbar should now become active. Click the Add Polygon Feature tool.

Activate the Add Polygon Feature
Activate the Add Polygon Feature.

 

To create a language polygon, click on the map to trace the language area, placing one vertex at a time. Right-click to finish and close the polygon.

Digitising a language area
Digitising a language area.

 

Right-clicking will also open a dialog box where you can enter attribute values for the feature you just created. For example, here we enter information for the Wersing language, including its name, Glottocode, map name and year. Notice how the ID is automatically generated (Autogenerate), ensuring no duplicates

Adding attributes
Adding attributes.

 

To save your edits, click the Save Layer Edits button (disk icon), or toggle editing off again.

Save layer edits
Save layer edits.

 

Don’t forget to save your QGIS project regularly by going to Project > Save As… to ensure that all your settings, layers, and views are preserved for future use. Now all you have to do is repeat the process to trace the remaining language areas on the map. The final digitised map should look something like this.

Digitised language map
The digitised language map.

 

When you have finished digitising, head to section Merging polygons by language name to combine the digitised individual polygons into Multipolygons grouped by language name.

Snapping

To improve the accuracy of your digitising when adding polygons, you can enable snapping so that new features align precisely with existing ones. First, make sure the Snapping Toolbar is visible. Go to View > Toolbars > Snapping Toolbar. In the Snapping Toolbar, click the magnet icon to enable snapping.

Snapping toolbar in QGIS
Snapping Toolbar with Snapping enabled.

 

Set the snapping distance. Typically 10 pixels is a good starting point. Next, check the option Avoid Overlap on Active Layer. This ensures that any new polygons you draw will not overlap with existing polygons in the same layer.

Avoid overlap on active layer
Activate Avoid Overlap on Active Layer.

 

Digitising now works as before, but with one key difference: QGIS will automatically snap new vertices to nearby existing ones when in snapping distance, helping you maintain clean, topologically correct boundaries.

Cutting enclave language polygons

In some cases, a language area may be completely surrounded by another. An example is Pennsylvania Dutch in the USA, entirely enclosed by English-speaking regions. For such a language enclave, we first cut a hole in the surrounding language area and then fill that hole with a new polygon representing the enclave language. This approach is suited for very specific cases and may not be commonly needed. However, in situations where an isolated language enclave exists, it is often the only viable method. While there is no such enclave on the Alor-Pantar language map, we will briefly walk through the steps required to create one. Click the Toggle Editing icon to start editing. Ensure the Advanced Digitising Toolbar is active. If not, go to View > Toolbars > Advanced Digitising Toolbar. In the Advanced Digitising Toolbar, click the Fill Ring icon.

Activate the fill ring tool
Activate the Fill Ring tool.

 

The Fill Ring tool is now active. This tool cuts a ring into a polygon and fills it with a new polygon. Trace the enclave as you would any other language polygon. Right-click to finish, and enter the attribute information for the enclave language. Note: We’re demonstrating this here solely to showcase the method—there are no enclave languages in the Alor-Pantar map.

Fill a ring.
Cut out and fill the ring.

 

You can now select the filled ring to verify that the tool created a new polygon.

The filled ring.
We cut a ring into a language polygon and filled it. While this may not make much sense on this map, it helps us create polygons for maps with language enclaves.

 

Splitting language polygons from existing landforms

This method splits language areas from an existing polygon dataset of continents and landforms. We use the Land polygons including major islands shapefile from the 1:10m Physical Vectors by Natural Earth. To load the shapefile in QGIS, go to Layer > Add Layer > Add Vector Layer…, browse to the file location of the Natural Earth land polygon shapefile, and click Add. Alternatively, you can simply drag and drop the shapefile into the Layers panel in QGIS.

The Natural Earth Land Polygons
The Natural Earth Land Polygons.

 

We can already see that the Natural Earth land polygons are not detailed enough for this region — they are missing the Pura and Treweng islands. While we likely wouldn’t use this dataset for digitising this language map, we will carry on for the sake of demonstrating how to cut language polygons.

Preparing the Base Layer for Digitising

The Natural Earth land polygons will serve as the base from which we cut out the digitised language areas. We need to prepare this vector layer first. Shapefiles are a clunky, legacy format that store geometry and attribute data across separate files. To streamline our workflow, we convert the layer to a GeoPackage. Right-click the layer in the Layers panel and go to Export > Save Features As….

Export the Natural Earth Land Polygons
Export the Natural Earth Land Polygons.

 

A dialog appears. Set the Format to GeoPackage, and specify the File name and location for the output file. You can leave the other settings as they are.

Save the Natural Earth Land Polygons as GeoPackage
Save the Natural Earth Land Polygons as GeoPackage.

 

Multipolygons to single parts

The Natural Earth polygons are stored as large Multipolygons, each containing potentially hundreds of single polygons. Before digitising, we need to separate these into their individual components using the Multipart to Singleparts tool. Go to Processing Toolbox > Vector geometry > Multipart to Singleparts….

Open the Multipart to Singleparts tool
Open the Multipart to Singleparts tool.

 

In the dialog, set the Input layer to the Natural Earth land polygons and define the file name and location for the single parts. Again, use the GeoPackage format.

The Multipart to Singleparts tool
The Multipart to Singleparts tool.

 

Click Run. This will generate a new layer where each polygon becomes a separate feature.

Cropping to the Language Map Region

Next, we crop the single-part land polygons to the region covered by our language map. Click the Select Features by Polygon icon in the toolbar.

Activate Select Features by Polygon
Activate Select Features by Polygon.

 

Draw a polygon around the area of interest to select all overlapping land polygons.

Select the land polygons
Select the land polygons overlapping the language map.

 

Export the selected features as a new layer. Right-click the layer and go to Export > Save Selected Features As…. This will isolate only the polygons within your region of interest

Export the selected features.
Export the selected land polygons.

 

In the dialog, save the selected features as a GeoPackage. Use the name of the language map as the file name and rename the fid column to id to match Glottography requirements.

Save selected features.
Save the selected land polygons as GeoPackage.

 

We have now cropped the Natural Earth polygons to only those overlapping with the language map.

Editing the Attribute Table

Next, we prepare the attribute table by removing irrelevant fields and adding those required by Glottography. Right-click the cropped layer and select Open Attribute Table.

Open Attribute Table
Open the Attribute Table of the cropped land polygons.

 

Click the Toggle Editing icon to enable edits.

Start editing the attributes
Start editing the attributes.

 

Click the Delete Field icon to remove irreleavant attributes.

Delete field.
Delete fields.

 

In the dialog, mark all fields for deletion except for id, and click OK.

Delete fields dialog.
The delete fields dialog.

 

With irrelevant fields removed, begin adding the required attributes by clicking the New Field icon.

New field.
Adding new attribute fields.

 

Add a new field called glottocode as a Text (string) with 8 characters, since Glottocodes always have exactly 8 characters.

Add field for glottocode.
Adding a new field for glottocodes.

 

Repeat the process to add the remaining fields name, full_map_name, year, and note, all as Text (string). We do not specify a length here, as we usually do not know the maximum length, for example the length of a language name. Once done, save your edits.

Save edits.
Saving the added attributes.

 

Your layer is now ready for splitting off language areas.

Splitting Language Polygons

To better see the map beneath, adjust the visual appearance of the cropped layer. Right-click the layer and select Properties.

Open the properties.
Opening the properties.

 

Navigate to the Symbology tab and change the style to a polygon with a visible outline and no fill.

Change the layer symbology
Change the layer symbology.

 

Now we begin digitising. Click the Toggle Editing icon to start editing.

Start editing
Start editing.

 

Ensure the Advanced Digitising Toolbar is active. If not go to View > Toolbars > Advanced Digitising Toolbar. Then click the Split Features tool and begin tracing the boundary of the language polygon.

Activete the split features tool
Activate the Split Features tool.

 

To trace the Kiraman language area, start in the ocean southwest of the landmass, cut across the polygon boundary, trace the language area northeastwards, then east and south, cutting back across the polygon boundary. Finish the shape with a right-click. Tip: Disable snapping by clicking the magnet icon. Snapping is helpful in other digitising tasks but can hinder working with the Split Features tool.

Splitting features
Splitting the Kiraman language area off the Natural Earth polygon of Alor island.

 

If you receive the error message: “No features were split: If there are selected features, the split tool only applies to those…“, this usually means the wrong feature was selected. Click Deselect Features from All Layers to fix the issue.

Adding Attribute Information

After tracing a language polygon, fill in the attribute fields. Use the Identify Features tool and ensure only the new polygon is highlighted.

Identify feature
Activate the Identify Feature tool.

 

In the Identify Results panel, click Edit Feature Form.

Activate the edit feature form
Activate the Edit Feature form.

 

A form will open where you can enter the relevant attribute data for the Kiraman language area. Note that the id (2) was autogenerated by QGIS.

Fill in attributes
Fill in the feature attributes for the Kiraman language area.

 

Click OK when done. Repeat the process until all language polygons are split from the Natural Earth land polygons.

Merging polygons by language name

Some languages are represented by several disjoint polygons. For example, the Wersing language occurs in Multiple polygons in the northeast, east, and southeast of Alor Island. We merge all digitised individual polygons into a single geometry — a Multipolygon — based on shared language name. Note, however, that this approach assumes the name uniquely identifies a language. If different languages share the same name, the merge into Multipolygons must instead be based on another identifier or attribute.

Click Vector > Geometry Tools > Collect Geometries…. The Collect Geometries dialog will open.

Open the Collect Geometries dialog
Open the Collect Geometries dialog.

 

In the Collect Geometries dialog, define the Input Layer and set the name column as the Unique ID fields. This will merge all polygons based on the shared name. This approach assumes that the name uniquely identifies a language, which is the case here. In Collected, define the output file in GeoPackage format and specify the layer name. Click Run. QGIS will then merge the polygons into Multipolygon geometries based on shared name.

The Collect Geometries dialog
The Collect Geometries dialog.

 

We can verify that Wersing, for example, is represented by a single geometry. Note that since we used the name rather than the id column during merging, polygons with distinct IDs were combined into a single Multipolygon and some IDs got lost. Also note that QGIS routinely adds an additional fid column for its own internal identification. For our purposes, we can simply ignore this column.

The Wersing Multipolygon
The individual polygons of the Wersing language merged into a single Multipolygon.

Output

A GeoPackage file containing language polygons and attributes (see Attributes and metadata and Glottocodes tutorial). The Alor–Pantar language polygons, digitised in this tutorial, can be downloaded here.