How to write a Parser/Writer for a new file format¶
In order to support your own file format in Poio API, you would need to
implement your own parser as a sub-class of the base class
poioapi.io.graf.BaseParser
. The base class contains six abstract
methods that will allow the GrAF converter to build a GrAF object from the
content of your files. The six methods are:
- get_root_tiers() - Get the root tiers.
- get_child_tiers_for_tier(tier) - Get the child tiers of a given tier.
- get_annotations_for_tier(tier, annotation_parent) - Get the annotations on a given tier.
- tier_has_regions(tier) - Check if the annotations on a given tier specify regions.
- region_for_annotation(annotation) - Get the region for a given annotation.
- get_primary_data() - Get the primary data that the annotations refer to.
Note: All the methods must be implemented, otherwise an exception will be raised.
The tiers and annotations that are passed to the methods are normally objects
from the classes poioapi.io.graf.Tier
and
poioapi.io.graf.Annotation
. If you need to pass additional
information between the methods, that are not present in our implementation
of the classes, you might also sub-class Tier
and/or Annotation
and add
your own properties. By sub-classing, you make sure that the properties from
our implementation are still there. The converter needs them to build the GrAF
object.
Each Tier
contains a name and an annotation_space property (the latter
is None by default). The class ElanTier
exemplifies the sub-classing of
Tier. In the case of Elan, we need to store an additional property
linguistic_type to be able to implement the complete parser:
class ElanTier(poioapi.io.graf.Tier):
__slots__ = ["linguistic_type"]
def __init__(self, name, linguistic_type):
self.name = name
self.linguistic_type = linguistic_type
self.annotation_space = linguistic_type
Tier
s use the annotation_space to describe that they share certain
annotation types. If the annotation_space is None the GrAF converter
will use the name as the label for the annotation space.
Each Annotation
is defined with a unique id property and can contain a
value and a ‘ features` property. Features are stored in a dictionary
in the feature_structure of the annotation in the GrAF representation.
References:
poioapi.io.graf.BaseParser
poioapi.io.graf.Tier
poioapi.io.graf.Annotation
Example: A simple parser based on static data¶
The transformation of annotation data to GrAF is done by the class
poioapi.io.graf.GrAFConverter
. This class will use the parser’s
methods to retrieve the information from the file.
Sub-classing from BaseParser¶
First, we will sub-class our own parser SimpleParser
from the class
poioapi.io.graf.BaseParser
with empty methods. We will set some
static data within the class that represent our tier names
and the annotations for each tier:
class SimpleParser(poioapi.io.graf.BaseParser):
tiers = ["utterance", "word", "wfw", "graid"]
utterance_tier = ["This is a utterance", "that is another utterance"]
word_tier = [['This', 'is', 'a', 'utterance'], ['that', 'is', 'another',
'utterance']]
wfw_tier = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h']
graid_tier = ['i', 'j', 'k', 'l', 'm', 'n', 'o', 'p']
def __init__(self):
pass
def get_root_tiers(self):
pass
def get_child_tiers_for_tier(self, tier):
pass
def get_annotations_for_tier(self, tier, annotation_parent=None):
pass
def tier_has_regions(self, tier):
pass
def region_for_annotation(self, annotation):
pass
def get_primary_data(self):
pass
If your annotations are stored in a file, then you need to implement your own
strategy how to load the file’s content into your parser class. The
__init__()
of your parser class might be a good place to load your file.
References:
poioapi.io.graf.GrAFConverter
Implementation of the parser methods¶
We will start with the get_root_tiers()
method. This method will return all
the root tiers as objects of the class Tier
(or a sub-class of it). In our
case, this is only the utterance tier:
def get_root_tiers(self):
return [poioapi.io.graf.Tier("utterance")]
The method get_child_tiers_for_tier()
returns all child tiers of
a given tier, again as Tier
objects. In our simple example, we assume that
the child of the utterance tier is the word tier, which has the
children graid and wfw:
def get_child_tiers_for_tier(self, tier):
if tier.name == "utterance":
return [poioapi.io.graf.Tier("word")]
if tier.name == "word":
return [poioapi.io.graf.Tier("graid"), poioapi.io.graf.Tier("wfw")]
return None
Note: This two methods must always return a list of Tier
objects or
None.
The method get_annotations_for_tier()
is used to collect the annotations
for a given tier. Each annotation must at least cotain a unique id and an
annotation value. Both properties are already present in the class
Annotation
that we use here to return the annotations. For the utterance
tier we can simply convert the list of strings in our self.utterance_tier
data store:
def get_annotations_for_tier(self, tier, annotation_parent=None):
if tier.name == "utterance":
return [poioapi.io.graf.Annotation(i, v)
for i, v in enumerate(self.utterance_tier)]
[...]
For all tiers that are children of another tier, the annotations within the tiers
are normally also children of another annotation on the parent tier. In this
case the Converter
will pass a value in the parameter annotation_parent.
In our case, the id of the parent annotation points to the location of the
child annotations in the lists self.word_tier, self.graid_tier and
self.wfw_tier:
[...]
if tier.name == "word":
return [poioapi.io.graf.Annotation(2 + 4 * annotation_parent.id + i, v) for i, v
in enumerate(self.word_tier[annotation_parent.id])]
if tier.name == "graid":
return [poioapi.io.graf.Annotation(
annotation_parent.id + 10, self.graid_tier[annotation_parent.id - 2])]
if tier.name == "wfw":
return [poioapi.io.graf.Annotation(
annotation_parent.id + 12, self.wfw_tier[annotation_parent.id - 2])]
return []
Note: This method must always return a list with Annotation
elements
or an empty list.
The method tier_has_regions()
describes which tiers contain regions.
These regions are intervals that refer to the primary data. Depending on the
type of the primary data the regions can encode intervals of time (encoded
as milliseconds, in most cases) or a range in a string (from start to end
position). In our case we assume that only the root tier utterance is
connected to the primary data via regions:
def tier_has_regions(self, tier):
if tier.name == "utterance":
return True
return False
To get the regions of a specific annotation the Converter
will call the
method region_for_annotation()
. This method must return a tuple with
start and end of the regions. In our example the tier with regions is the
utterance tier. So the region for the first utterance is (0, 19)
, if we
assume that we want to return the content of the two utterances connected
with a blank ” ” as the primary data. We can simply calculate the regions from
the length of the strings in self.utterance_tier
:
def region_for_annotation(self, annotation):
if annotation.id == 0:
return (0, len(self.utterance_tier[0]))
elif annotation.id == 1:
return (len(self.utterance_tier[0]) + 1,
len(self.utterance_tier[0]) + 1 + len(self.utterance_tier[1]))
Last but not least, we also have to return the primary data. As the utterance
tier was the root tier and we already defined the regions for the utterance
annotations based on the strings in self.utterance_tier
we can simply join
the two strings and return the result as the primary data:
def get_primary_data(self):
return ' '.join(self.utterance_tier)
Using the parser to convert to GrAF¶
You can now use the SimpleParser
class to convert the static data into
a GrAF object:
parser = SimpleParser()
converter = poioapi.io.graf.GrAFConverter(parser)
converter.parse()
graf = converter.graf
The converter object contains two more objects that contain information from the parsed data:
- The tier hierarchies is stored in converter.tier_hierarchies.
- The primary data for the annotations is stored in converter.primary_data.
If you want to write the data to GrAF files, you have to create a GrAF writer object and pass it to the Converter’s constructor:
parser = SimpleParser()
writer = poioapi.io.graf.Writer()
converter = poioapi.io.graf.GrAFConverter(parser, writer)
converter.parse()
converter.write("simple.hdr")
The section Spreadsheet to GrAF conversion discusses a slightly more complex use case: how to write a parser for custom annotations stored in a Microsoft Excel file.