NORSTAR Banner

 

ASI Filenames

This document contains a specification of ASI data filenames used in the NORSTAR project.  It is based on an earlier discussion document which elicited many helpful comments from Trond, Mikko, and Eric.

Version 1.0

The original convention for naming data files was implemented by Trond Trondsen for the first NORSTAR imagers.  Files names have the structure:

    <site><yyyymmdd>_<hhmmss>_<filter>[_<optional>].<extension>

where <site> is an uppercase four letter code indicating the instrument location.  Date and time are provided by <yyyymmdd> and <hhmmss>.  The file name ends with information about the filter used for each image, optionally followed by further information indicating dark or calibration frames, and finished by a file extension indicating the type of image file.  An example of a complete file name is:  

    GILL20011223_230143_6300_DARK.png

and a regular expression match is:

    ([A-Z]{4})([0-9]{8})_([0-9]{6})_([0-9_A-Z]{3,4})_?([A-Z]*)\.(.*)$


where parenthesis allow for sub-expression matching and extraction.

Some examples of field values are
  • Sites codes: GILL, RANK, RESU
  • Filter codes: 4378, 5577, 6300, 0000, ____, NIR
  • Optional fields: DARK, CAL
  • Extensions: png, pgm, pgm.gz
As images are acquired they are placed in directories using the convention
    /data/<site>_<yyyymmdd>/ut<hh>/

so that each hourly sub-directory may contain several hundred files.  Data is stored in Calgary using an additional directory layer of the form
    norstar/<site>/<yyyymm>/<site>_<yyyymmdd>/ut<hh>/

in order to produce managable chunks of data.

This method of naming files has worked very well.  It ensures that files produced by a single instrument at each site are unique, and useful information about an image can be obtained directly from the filename.

However, there are some quirks which can cause confusion, and the potential for future difficulties as the project expands.  One minor problem is that most, but not all, fields are separated by underscores.  This complicates the task of splitting up a filename to extract information. Another minor issue is that empty (open) filters are denoted variously by "0000" and "____".  Perhaps more importantly, this naming convention does not allow for the possibility of multiple imagers at the same site.

Filter surprises

Including filter information in the filename makes it easy to select sequences of images at the same wavelength.  As it turns out, placing this in the last field is arguably the best location.  This is because the filter code (ie. 4861 or NIR) may need to be followed by additional information (such as DARK to indicate that the shutter is closed). Rather than simply append the information about dark (or possibly calibration) frames to the filter code, it is better to allow for an optional final field.  Filter codes currently in use are either 3 or 4 characters long, and may be letters or numbers.  Extending this to up to 6 (and requiring at least one) characters should provide enough  room for future requirements.  The regular expression match for the filter and optional field would thus be [0-9A-Z]{1,6}_?[A-Z]{0,6}.  

As noted by both Trond and Mikko, proper dark frames don't really depend on which filter was selected.  Hence, the filter code could simply be replaced by DARK, rather than appending it.  However, in the current operating setup each filter may have different exposure times, gains, and binning parameters.  Consequently, the dark frames collected for each filter may be very different.  Keeping the filter code should make it easier to match up the appropriate dark frame with each set of images.

Note: keep the current strategy of appending _DARK after filter codes.

At some sites, four underscores (____) have been used to indicate the open field (no filter).  Others use four zeros (0000).  Different conventions are a bit confusing, and the use of underscores conflicts with the goal of using them as unique field separators.

Note: all filter codes should be 1-8 alphanumeric characters.  Any occurances of ____ or 0000 should be replaced with none.

Instrument names

The version 1.0 file name does not currently contain information about the specific instrument used to collect each image.  This is not a problem if there is only ever one ASI at each site.  In this case, the site code and date are sufficient to determine which instrumental characteristics are relevant for image processing or transformation.  Note also that the instrument information should also be contained within the file header.

There is, however, the potential for trouble if two or more instruments are operated simultaneously at the same location.  This has happened already, with Wilbur and Polaris at Gillam, and the two PoCa cameras at Eureka.  In the future it is possible that Themis imagers may be operated beside the NORSTAR instruments.  If images may be acquired at the same time and with the same filter code then they must always be kept in separate directories or data will be lost.


Version 1.1

This convention addresses many limitations of version 1.0 .  Underscores are used to separate all fields and the instrument name is inserted between the time and filter fields.  The structure is now

    <site>_<yyyymmdd>_<hhmmss>_<instrument>_<filter>[_<optional>].<extension>
 
the regular expression match is:

    (
[A-Z0-9]+)_([0-9]{8})_([0-9]{6})_([A-Z0-9]+)_([0-9A-Z]+)_?([A-Z]*)\.(.*)$

Note that this explicitly allows site codes with more than 4 characters (or numbers).  However, the use of overly long site or instrument names is discouraged in order to keep filenames to a reasonable length.  Underscores should only occur as field separators, so all instances of ____ must be replaced with 0000.  Other filter codes, optional fields, and extensions should remain the same.  Instrument codes should be based on the names given in Table 1.  As the project expands it may be necessary to move to less creative names such as ASI01, ASI02, etc.



Table 1: List of ASI sites, codes, instrument and project names.
Site Name Site Code
Instruments
Projects
Gillam, Manitoba
GILL
Wilbur, PolarisX, Polaris
CANOPUS, NORSTAR
Rankin Inlet, Nunavut
RANK
Aqsaniq
NORSTAR
Resolute Bay, Nunavut
RESU
Taqqiila
NORSTAR
Eureka, Nunavut
EURE
Poca0, Poca1
CNSR, NORSTAR
Athabasca, Alberta
ATHA
Wilbur
NORSTAR




This updated convention produces unique and informative file names.  However, they are extremely long, and certain fields (such as year and month) also appear several times in the fully qualified filename.  

Version 2.0


The conventions used in v1.0 and v1.1 implicitly assume that filenames should be unique.  This is accomplished by placing a large amount of information in the filename, which of course tends to produce very long filenames.  An alternative is to take advantage of the directory tree as suggested here:
 

<site>/<yyyy>/<mm>/<dd>/ut<hh>/<mmss>_<instrument>_<filter>[_<optional>].<extension>
 
or even possibly:

    <site>/<yyyy>/<mm>/<dd>/<instrument>_<filter>/ut<hh>/<mmss>_[_<optional>].<extension>

in which case the regular expression matches are:
    ^.*([A-Z0-9]+)[/\\]([0-9]{4})[/\\]([0-9]{2})[/\\]([0-9]{2})[/\\]([A-Z0-9]+)_([0-9A-Z]+)[/\\]ut([0-9]{2})[/\\]

for the path and:
 
        ([0-9]{4})_?([A-Z]*)\.(.*)$

for the filename alone. An example would be:

    GILL/2002/11/23/poca0_630nm/ut03/2308_dark.png

for the Poca #0 camera collecting a dark frame at the 630nm filter setting at 03:23:08 UT on November 23 2002 at Gillam.

This eliminates redundancy between the directory path and file names and consequently produces much shorter file names.  Another advantage is that creating a catalog of hourly coverage for each filter does not require listing large numbers of files.  However, the lack of uniqueness means that files from different directories cannot be moved into a common area without the possibility of something being over written.  Note that each file will have a header containing all the information required to identify the correct directory location, or to allow a unique long (version 1.1) filename to be created.

Care must be taken when using non-unique file names. If they are produced in the field, then software must first check that the given filename does not already exist (even if that is "impossible"). In the case of a naming collision, some unique string (such as the current unix system time) should be appended to create a new file name, so that files are never overwritten. The <instrument>_<filter> directory level is a bit different from the others, in that it groups two different information fields.  This allows tricks like this:

[clearwater]/arena/norstar/raw : ls EURE/2002/11/23/*630*
./     ../    ut01/  ut02/  ut03/


to quickly identify what hours of data exist for any kind of filter.

My plans are to convert all files on Twoface from v1.0 to v2.0 and see how that works.  If we need to move files around then v1.1 is probably the best naming convention.  As new site software is fielded we should definitely transition to v1.1 so that instruments like PoCa can function properly.


Additional Comments


Underscores separate fields

Both automatic and human parsing of file names is simplified if fields of information are separated by some unique character.  The underscore (_) is a common choice for this purpose, and is used for that purpose in our current format.  

Note
: Some of the early Gillam Polaris data uses four underscores (____) to denote an empty filter.  Beware.

Extensions are special

Using file name extensions to denote file types is common practice.  Several different types of image (data) files have been used so far in the NORSTAR project.  The most common is the Portable Network Graphics format, which uses a .png extension.  Data from Gillam has also been stored as Portable Network Graphics "anymap" files, which have a .pnm extension.  These may have been compressed using the standard GZIP algorithm to produce files with a .pnm.gz extension.  In the future there may well be other modifications to the image file type.  We should continue to use the decimal or period as a separator.  All parts of the file name before the first "dot" will be information about the image itself, while everything after indicates the format in which the image data is stored.

Case insensitivity

Although case doesn't really matter, it would be nice to have some consistency.  Also, strange things can happen to case when moved between operating systems.  The typical *nix default to use lower case letters, with upper case reserved for "shouting".  However, we're currently using upper case for everything (ie. GILL and NIR). Fewer changes are required to simply continue in this vein.

Note: Assume that all site codes (ie. GILL) will be upper case.  Allow for mixed case instrument and filter names. This may require that some regular expressions  be modified (ie. [A-Z] changed to [a-zA-Z]).  


Site codes are (at least) 4 characters long

The current convention is to use four letters of the alphabet to label each site (see Table 1).  Four letters should be enough to uniquely label an effectively unlimited number of sites. This avoids conflicts with the IAGA 3-letter site codes, and allows consistency with the CANOPUS codes.  However, there could be some confusion if observations were made from multiple nearby sites or if sites were moved by a significant amount while still remaining in the same municipality.  

Note: assume that the site field may be up to six characters long.  The first four must be letters of the alphabet to remain consistent with the current format.  The next two characters are optional, and may be numbers or letters.

In terms of regular expressions, the site code must match [A-Z]{4}[A-Z0-9]{0,2}.  For example, a new site a Gillam might be have the site code GILL01. I suspect that this change may in fact never be nccessary, but would rather overplan than find ourselves painted into a corner.


Dates and times are good

The current strategy is fine.  A minimal regular expression for matching this part is [0-9]{8}_[0-9]{6}, although a more stringent version could also be used [12][901][0-9]{2}[01][0-9][0-3][0-9]_[012][0-9][0-5][0-9][0-5][0-9] in order to check that the numbers are reasonable.  Probably overkill, as Mikko says: "assume that the file names are already correctly formed.  If not, then you have bigger problems..."

Long file names

If file names are not unique, then they must be kept in separate directories to avoid files being overwritten and data lost.   The minimal standard for filenames is IS-9660 which allows a total length of 11 characters with 8 in the body and three in the suffix.  The Joliet file system (a Microsoft extension to ISO 9660) allows the use of Unicode characters and file names of up to 64 characters in length.  There is also the "Rock Ridge" set of extensions for Unix systems which also allows for long file names.  I can't find any clear statement about the maximum file length, except for one document that suggested 128 characters.  For now I'll assume that it's at least 64.

Version 1.0 file names are at least 24 characters long, plus extension.  Clearly, 8+3 is already inadequate.  The modified format of version 1.1 would require no more than 50 characters (plus the extension) which leaves room for future changes as required while still fitting in a 64 character limit.



Changing the instrument name? Naming each ASI individually is reasonable if there are a small number of instruments.  However, it will start to get inconvenient as we field more imagers.  Earlier instrument names such as PoCa and Wilbur are in common use and shouldn't be altered.  However, the newer instrument names (Aqsaniq, Taqqiila, Polaris) could probably be changed without too much confusion.   As Trond notes, this change is in favor of automated processing, at the expense of human readability.  I agree, but think the tradeoff will be worth it as the project becomes larger.


Any new naming scheme should group cameras by type. Some suggested names are provided in Table 2.  My previous whimsical suggestion of "betacam" wasn't too popular, so this time I've gone for the terse approach of "ASI" + instrument number.


Table 2: New Instrument Naming Scheme
Old Name
New Name
Polaris (first version)
ASI00
Aqsaniq
ASI01
Taqqiila
ASI02
Polaris (upgraded)
ASI03
imager from enhancement contract
ASI04
new bare CCD imager
ASI05, ASI06...
Themis instruments
THEMIS01, THEMIS02...


The regular expression match should be fairly general with a maximum length of 12 characters: [A-Z0-9]{4,12}.  The file names are already rather long, and 12 characters should be enough to uniquely describe each instrument.  Aside from the increased length, the file names are now slightly redundant, in that the instrument label and date should be enough to determine the site (ie. the same camera can't be in two places at once).  However, it is convenient to have the site information readily available.

Change #5: change the instrument names (labels) as indicated in Table 2.

||©2005 NORSTAR