Getting Started Guide

Here are some quick steps to get you started with OpenFormats.

Installation

To use OpenFormats as a Python library, simply install it with pip, prefixing with sudo if permissions warrant:

pip install openformats

If you plan to tweak the codebase or add your own format handler, grab a copy of the whole repository from GitHub:

git clone https://github.com/transifex/openformats.git
cd openformats

Creating your own handler

OpenFormats supports a variety of file formats, including plaintext (.txt), subtitles (.srt) and others. Here are the steps to create your own handler.

1. Subclass the base Handler

class openformats.handlers.Handler

This class defines the interface you need to implement in order to create a handler. Both the parse and compile methods must be implemented.

parse(content)

Parses the content, extracts translatable strings into a stringset, replaces them with hashes and returns a tuple of the template with the stringset

Typically this is done in the following way:

  • Use a library or your own code to segment (deserialize) the content into translatable entities.
  • Choose a key to uniquely identify the entity.
  • Create an OpenString object representing the entity.
  • Create a hash to replace the original content with.
  • Create a stringset with the content.
  • Use library or own code to serialize stringset back into a template.
compile(template, stringset)

Parses the template, finds the hashes, replaces them with strings from the stringset and returns the compiled file. If a hash in the template isn’t found in the stringset, it’s a good practice to remove the whole string section surrounding it

Typically this is done in the following way:

  • Use a library or own code to segment (deserialize) the template into translatable entities, as if assuming that the hashes are the translatable entities.
  • Make sure the hash matches the first string in the stringset.
  • Replace the hash with the string.
  • Use library or own code to serialize stringset back into a compiled file.

You can safely assume that the stringset will have strings in the correct order for the above process and thus you will probably be able to perform the whole compilation in a single pass.

The following are some classes that will help you with this process:

2. The OpenString class

class openformats.strings.OpenString(key, string_or_strings, **kwargs)

This class will abstract away the business of generating hashes out of your strings and will serve as a place to get translations from when compiling. Several OpenStrings in our process define a Stringset, which is simply a python list of OpenStrings. To create an OpenString, you need 2 arguments:

  • The ‘key’

    Something in your source file that uniquely identifies the section that the source string originated from. It might be helpful for your compiler to use something that appears in the same form in language files as well.

  • The ‘string’ or ‘plural forms of the string’:

    If the file format you’re working with does not support plural forms, or if the string in question is not pluralized, you can just supply the string itself as the second argument. If you string is pluralized however, you have to supply all plural forms in a dictionary with the rule numbers as keys. For example:

    OpenString("UNREAD MESSAGES",
               {1: "You have %s unread message",
                5: "You have %s unread messages"})
    
  • There are a number of optional keyword arguments to OpenString:

    context, order, character_limit, occurrences, developer_comment, flags, fuzzy, obsolete

Their main purpose is to provide context to the translators so that they can achieve higher quality. Two of them however, though optional, are highly recommended:

  • Context

    This is also taken into account when producing the hash, so if you can’t ensure that your keys aren’t unique within the source file, you can still get away with ensuring that the (key, context) pair is.

  • Order

    If you provide an order (integer), Transifex will save it in the database and then, when you try to compile a template against a stringset fetched from Transifex, it will already be ordered, even if it contains translations. This can allow you to optimize the compilation process as the order that the hashes appear in the template will be the same as the order of strings in the stringset.

    Another valuable outcome is that the order will be preserved when the strings are shown to translators which can provide context and thus improve translation quality.

Once you have created an OpenString, you can get it’s hash using the template_replacement property

3. The Transcriber

class openformats.transcribers.Transcriber(source)

This class helps with creating a template from an imported file or compile an output file from a template.

Main functionality

This class will help with both creating a template from an imported file and with compiling a file from a template. It provides functions for copying text. It depends on 3 things, the source content (self.source), the target content (self.destination) which initially will contain an empty string and a pointer (self.ptr) which will indicate which parts of ‘source’ have already been copied to ‘destination’ (and will be initialized to 0).

Transcriber detects and remembers the newline type (DOS, '\r\n' or UNIX '\n') of ‘source’. It then converts ‘source’ to UNIX-like newlines and works on this. When returning the destination, the initial newline type will be used. Because ‘source’ is being potentially edited, it’s a good idea to save Transcriber’s source back on top of the original one:

>>> def parse(self, source):
...     self.transcriber = Transcriber(source)
...     self.source = self.transcriber.source
...     # ...

The main methods provided are demonstrated below:

>>> transcriber = Transcriber(source)

source:      <string name="foo">hello world</string>
ptr:         ^ (0)
destination: []

>>> transcriber.copy_until(source.index('>') + 1)

source:      <string name="foo">hello world</string>
ptr:                            ^
destination: ['<string name="foo">']

>>> transcriber.add("aee8cc2abd5abd5a87cd784be_tr")

source:      <string name="foo">hello world</string>
ptr:                            ^
destination: ['<string name="foo">', 'aee8cc2abd5abd5a87cd784be_tr']

>>> transcriber.skip(len("hello world"))

source:      <string name="foo">hello world</string>
ptr:                                       ^
destination: ['<string name="foo">', 'aee8cc2abd5abd5a87cd784be_tr']

>>> transcriber.copy_until(source.index("</string>") +
...                        len("</string>"))

source:      <string name="foo">hello world</string>
ptr:                                                ^
destination: ['<string name="foo">', 'aee8cc2abd5abd5a87cd784be_tr',
'</string>']

>>> print transcriber.get_destination()

<string name="foo">aee8cc2abd5abd5a87cd784be_tr</string>
remove_section(place=0)

You can mark sections in the target file and optionally remove them. Insert the section-start and section-end bookmarks wherever you want to mark a section. Then you can remove a section with remove_section(). For example:

>>> transcriber = Transcriber(source)

source:      <keep><remove>
ptr:         ^ (0)
destination: []

>>> start = 0

>>> transcriber.mark_section_start()
>>> transcriber.copy_until(start + 1)  # copy until first '<'
>>> string = source[start + 1:source.index('>', start)]
>>> transcriber.add("asdf")  # add the hash
>>> transcriber.skip(len(string))
>>> transcriber.copy_until(source.index('>', start) + 1)
>>> transcriber.mark_section_end()

source:      <keep><remove>
ptr:               ^
destination: [SectionStart, '<', 'asdf', '>', SectionEnd]

>>> if string == "remove":
...     transcriber.remove_section()

(nothing happens)

>>> start = source.index('>') + 1

>>> # Same deal as before, mostly
>>> transcriber.mark_section_start()
>>> transcriber.copy_until(start + 1)  # copy until second '<'
>>> string = source[start + 1:source.index('>', start)]
>>> transcriber.add("fdsa")  # add the hash
>>> transcriber.skip(len(string))
>>> transcriber.copy_until(source.index('>', start) + 1)
>>> transcriber.mark_section_end()

source:      <keep><remove>
ptr:                       ^
destination: [SectionStart, '<', 'asdf', '>', SectionEnd,
              SectionStart, '<', 'fdsa', '>', SectionEnd]

>>> if string == "remove":
...     transcriber.remove_section()

source:      <keep><remove>
ptr:                       ^
destination: [SectionStart,  '<', 'asdf', '>',  SectionEnd,
              None        , None, None  , None, None      ]

(The last section was replaced with Nones)

Now, when you try to get the result with `get_destination()`, the
Nones, SectionStarts and SectionEnds will be ommited:

>>> transcriber.get_destination()

<asdf>
line_number

The transcriber remembers how many newlines it has went over on the source, both when copying and skipping content. This allows you to pinpoint the line-number a parse-error has occured. For example:

source:
    first line
    second line
    third line with error
    fourth line

>>> transcriber = Transcriber(source)
>>> for line in source.split("\n"):
>>>     if "error" not in line:
>>>         # include the newline too
>>>         transcriber.copy(len(line) + 1)
>>>     else:
>>>         raise ParseError(
>>>             "Error on line {line_no}: '{line}'".format(
>>>                 line_no=transcriber.line_number,
>>>                 line=line
>>>             )
>>>         )

This will raise a::

>>> ParseError("Error on line 3: 'third line with error'")
edit_newlines(chunk, enforce_newline_type=None)

This is the part that renders the newlines to their correct type when returning the final result. You have the option to enforce the newline type if you want to.

>>> source = "hello\r\nworld"
>>> t = Transcriber(source)
>>> t.source
>>> "hello\nworld"
>>> source = trascriber.source
>>> # Work as if source was UNIX-type
>>> t.copy_until(source.index('\n') + 1)  # include the '\n'
>>> t.add("fellas")
>>> t.get_destination()
>>> "hello\r\nfellas"  # <- it remembered newline type from source
>>> t.get_destination(enforce_newline_type="UNIX")
>>> "hello\nfellas"

Continue reading the other documentation sections for further details.