Jon's steps for SFM cleanup and import
29 Oct 2011
(This was originally posted on a private mailing list. It would be good to update it or incorporate parts of it into the main SFM cleanup page instead.)
Dear import specialists,
Beth documented and sent out her list of steps to use when importing into FLEx, and she invited us all to do the same. So, I thought I'd include my process for those of you who are interested. I like to use Solid for this, which is a great way to gradually improve the quality of SFM data in a sort of upward spiral. One nice thing is that there isn't usually a specific order in which to do the cleanup, except with regard to a few steps (e.g. when to use a particular CC table, and when to run a Quick Fix such as “make inferred markers real”).
But first, I should mention one structural dilemma. One issue that the FLEx importer has trouble with is the position of \ps in the MDF hierarchy. It can occur above one or more \sn fields, whereas the FLEx database is set up with what amounts to one and only one \ps below every \sn.
When translating structures, the importer guesses well most of the time, but if you want it to get the categories 100% right, you'll want a one-to-one ratio between \ps and \sn. Otherwise, any sense whose \ps field doesn't exist or is empty will probably get assigned the \ps of the following sense. At present, I'd recommend doing so in this way, using the standard MDF structure:
(A)
\lx
\se
\ps n
\sn
\de
\ps n
\sn
\de
Ken Z made a really helpful CC table for doing this (“Add missing ps fields.cc”), and I've attached it here. No guarantees, of course, but it worked well for me. More details below.
The other option is to invert the order of \sn and \ps to more closely resemble FLEx, as follows (note that the order of sibling fields doesn't matter):
(B)
\lx
\se
\sn
\de
\ps n
\sn
\ps n
\de
Maybe a CC guru could handle this conversion, but I think it's more complicated. Pretty soon, there will be a quick fix in Solid to “push \ps down under subsequent \sn's”, at which point some people may find that more convenient than the CC table.
Okay, with that said, here are my steps:
- Open the file in Solid using the standard “MDF Unicode” template.
- Eliminate all flagged errors, either by editing the SFM file or by relaxing the specified rules/hierarchy. (When possible, avoid inferring markers, since this can mask errors.)
- Gradually tighten up the rules/hierarchy. You want to make your structure as strict as you reasonably can.
- Run the “make inferred markers real” Quick Fix to insert missing fields like \sn, \rf, etc. But first make sure no errors are showing. (Otherwise, Solid may infer things in the wrong places.)
- Assuming structure (A) above: With no errors showing in Solid, exit Solid and run the CC table to add \ps fields. (To verify in Solid, check that ps count = sn count, and also change sn to only be allowed to occur once per ps.)
- We now have enough \ps fields, but the empty ones will still get ignored by FLEx. So, in a tool such as Notepad++ * , use a regular expression to fill in empty \ps fields. E.g. to fill with “zz”, use a regular expression like this one:
Replace All \\ps $ With \ps zz
- If you significantly relaxed the structure earlier (in the top-left pane), you may want to undo those changes, or start over by deleting/renaming your .solid file and reopening the SFM file. You may want to switch to the “FLEx-Friendly” template, which is very strict and also flags all non-FLEx fields as invalid by default. (You'll want to import those into custom fields.) Note that this template uses structure (B) above, so you may need to adjust the hierarchy around \ps and \sn, but do avoid relaxing the template's strictness.
- If you've used fields at multiple levels, split them into multiple fields. E.g. split \nt into \nt_lx, \nt_se, \nt_sn (and maybe \nt_rf). This way you can assign each field to a different destination in FLEx. You can also tighten up your Solid settings. (A CC table for splitting \cf into \cf and \cf_se is attached.) Exception: some fields (ph, et, eg, es, ec) import just fine into both Lexeme and Subentry without being split first.
- If you've created variants of subentries by nesting a \va field under \se, there's an importer issue at present. Avoid it by removing those \va fields and replacing them with \lx entries pointing back to the \se from the other direction (via an \mn field).
E.g. replace this: \lx jump \se jumper \va jumpre with this \lx jump \se jumper \lx jumpre \mn jumper - Either import your parts of speech into a custom field (and later use filtering and Bulk Edit to fill in the categories), or else make sure that the contents of your \ps field exactly match the ABBREVIATIONS of your categories in FLEx. There's no need to import the \pn field.
- With no errors showing in Solid, exit Solid and import into FLEx!
- Avoid doing significant editing in FLEx until you've verified that all basic structures made it through the import. Be prepared to import multiple times, though steps 1-11 MAY eliminate the need.
If you decided to go with structure (B) above, replace step 5 above as follows:
5B. With no errors showing in Solid, run the Quick Fix to “push \ps down to subsequent \sn's”. Immediately switch to the “FLEx-Friendly” template.
After each step, I make a backup, and after running a CC table or Quick Fix, I compare the before and after files using WinMerge * . This helps to catch problems before they snowball.
Happy importing,
Jon C (Philippines / Indonesia)
* free open-source software; available at portableapps.com (and other places too). Notepad++ is similar to Notetab in its tabbed interface and powerful “Replace All in All Open Documents” feature. It can also manage some of the simpler unicode conversions quite easily. WinMerge is also great at comparing FOLDERS to each other and manually syncing them. I.e. you can use it to manually make and update backups if you like.