Click here to return to the main LingTranSoft.info site.

To edit, you will need to either login or otherwise prove you are human by typing in a special code each time.

Importing Toolbox dictionaries into FLEx

Note: Some consultants had posted their own sets of cleanup/import steps on a separate mailing list. These have been pasted in and can currently be viewed separately, though it would be good to eventually incorporate parts of them into this list:

1. Run a utility like “joinlines” on it, so that every field is on only a single line (no wrapped fields).

If you don't have Joinlines, you can do this in Word or in the Alt.Find and replace tool of LibreOffice or OpenOffice. Replace all newlines with a space, then replace any space before a backslash with newline.

UPDATE: If you're using Toolbox and it's been doing the wrapping, then as of version 1.5.5, it's probably better to let Toolbox do the unwrapping too: With Toolbox closed, open the lexicon in a text editor, and add a line containing \SaveWithoutNewlines somewhere between the first and last lines.

* It is probably also wise at this point to remove all trailing spaces in the file.

2. Is the data in Unicode?

  • Do you have encoding conversion maps for any fields that are in legacy? (These can be applied during import, but you need to have them on hand.)

3. Are there inline character formatting codes in this data?

  • What are they?
  • What do they mean (e.g., writing system or style or ??)?
  • What fields do they occur in?

4. Analyze the SFMs

  • If you have access to Linux, you can create a list of all the SFMs used in the file by doing something like (where “datafile.sfm” is the name of your data file):
cat datafile.sfm | sed -e “s/ .*$//” | sort | uniq -c > sfm.lst 

The list of all SFMs used will now be in the sfm.lst file.

* In Toolbox, you can see the SFMs used in the file by going to Database > Properties. Any marker in bold is used in the file. A marker not in bold is in the database type file but not in the data file.
  • Look for typos
    • For markers that occur only once, is the marker spelled wrong?
    • Are there places where the space between the marker and the contents was omitted?
    • Are there any other anomalies you see by looking over the list of markers?
    • Create a copy of the database with a new name (e.g., datafile-mod.sfm) in which you start making changes according to what you find.
    • Only make changes you are confident of. For others, make a note of questions to ask the linguist. Record all the changes you make so later you can show the linguist what you did.
  • Make a chart of the SFMs. For each one, indicate:
    • What it is used for
    • What writing system it is in/should go into in FLEx
    • Target field in FLEx
    • Whether it needs a custom field
    • If it is a custom field, what writing system, and what level in the hierarchy (entry, sense, example, allomorph)

5. Unknown fields: fields where it is not clear what the contents are or what the function is

  • Look in the .typ file to see if there is an indication of what it is.
  • Look at the contents to see if you can guess.
    • You can just search for that field in a text editor and look in several of them—sometimes that is enough.
    • You can also do something like egrep “^.lf “ datafile.sfm | sort | uniq -c > lf.lst to see the complete list of contents for that field. This is especially useful for fields with fixed contents, such as the source of the data or a confidence level or a lexical relation.
    • Ask the linguist.

6. Assess fixed-content fields

  • Certain fields tend to have “list” content in them. For these fields, you have three choices:
    • Bring it directly into the corresponding field in FLEx. (Only do this if you have first determined that the content is “clean” and you have prepared that list in FLEx to receive this data.)
    • Clean up the data, and then bring it directly into FLEx.
    • Bring it into a custom field and do the cleanup after it is in FLEx, or teach the linguist how to do the cleanup themselves.
  • The fields it is important to check in this way include: \ps, \lf, \sd. For the data you are working with, you may discover others that also need this evaluation.
  • The things to check are:
    • What are the contents of this field (if it hasn't been clear already)?
    • Is the data in this field consistent, or have the same things been spelled differently? (e.g., adj, Adj, adj., Adj., adj . and also watch for trailing spaces)
    • What writing system is it in? Is it consistent? Are there embedded writing systems?
    • Are the contents a full name, or an abbreviation? (e.g., “adjective” vs. “adj”) Is it consistent in this regard? (e.g., some fields are abbreviations and others are names.
    • Decide whether to clean up inconsistencies you have discovered in this way. If you do, do it in the copy of the file you might have made above, e.g., datafile-mod.sfm. Be sure you can report all the changes you made.

7. Evaluate any complex entries (variants or subentries) and cross-references

  • How are they indicated?
  • Is it structured the way FLEx expects?
  • Are the minor entries also explicitly in the database? Will there be duplicates created on import?
  • Does the linguist want everything that is referenced to be created as a new entry, if it wasn't already in the lexicon
  • For cross-references and lexical relations (e.g., synonyms), should a new entry be created for a referenced word if it is not already in the lexicon? (FLEx will create it. If the linguist doesn't want that, you will need to do some modification to prevent that.)

8. Run SOLID on the database

  • Determine whether this is the default MDF hierarchy (\ps is above \sn) or the “alternate” hierarchy” (\ps is inside of \sn).
  • If there is any legacy data in this database, choose one of the “legacy” options.
  • Determine the hierarchy of the fields, and set that up in SOLID.
  • For each marker, allow as few “parent” markers as possible.
  • Avoid using “infer [parent] marker” for as many fields as possible. It is co only needed for \sn, and possibly for examples, but in most cases you will get better results if you don't infer.
  • SOLID allows entry-level fields to occur between and after sense-level fields, but FLEx interprets these differently. Be very careful with this kind of thing. You may need to move some of these fields.
  • SOLID allows the same marker to be used at all levels of the hierarchy, but FLEx doesn't. So, for instance, if the linguist has used \co for a co ent on entries, senses, and examples, you will need to change some of these markers. Use one marker for co ents on an entry, another for co ents on senses, and another for co ents on examples. “Source” is another one that is co only used at many levels.
  • When it is ambiguous whether one of those fields refers to the entry level or sense level, you may need to check with the linguist. If they are sitting with you as you are doing it, you can ask as you find them. Or you may need to make a list of things to ask them in email or when you can make an appointment with them.
  • The Quick Fixes in SOLID can be helpful for moving fields to a different position in the hierarchy, or for converting the implied markers into actual markers in the file. But be extremely careful with these, and only use them after everything else is coming out correctly. And always have a backup before you apply them.

9. Prepare FLEx to receive the data.

  • Set up the writing systems that occur in this data.
  • Create the categories (part of speech) that you have found in the data. If you are importing directly into the category field, make sure the contents of the fields matches what you create in FLEx. For instance, if the field contains names, make sure the names match. If the field contains abbreviations, edit the abbreviations for all the categories you create to be sure they match.
  • Create any custom fields that will be needed.
  • If necessary, create new values for Status or for Lexical Relations. Make sure the name or abbreviation match what is in the data.
  • BACK UP the empty project AFTER you have created all these things!! Usually when I do an import, I have to try it over and over again, and the way I start over is by restoring fro an empty database. It is really helpful if that empty database has all these things already done to it—it is quite difficult (and error-prone) to have to do them over and over.

10. Try the import, check the import preview, make any changes, and try it again.

  • There may be messages about invalid UTF8 data. Double check the writing systems and the contents.
  • If the import preview reports entries that need checking, investigate what the problem is. It is usually a hierarchy problem, and it might be fixed in several ways:
    • Maybe the key markers need to be adjusted (in the import wizard).
    • Maybe the structure of the database still needs further adjusting.
Languages
Translations of this page: