Bioinformatics is rife with formats and parsers for those formats.
These parsers don't always agree on the definitions of these formats, since many lack any sort of formal standard.
This repository aims to consolidate a collection of format specimens, forming a unified file set for testing software. Testing against the same cases is a first step towards agreeing on the details and edge cases of a format.
Unlike its predecessor BioFmtSpeciments, FormatSpecimens is version controlled and released to a julia package registry, and features a small julia module to assist in unit-testing.
FormatSpecimens is built primarily for BioJulia, and is maintained with compatibility with the BioJulia ecosystem of tools, and BioJulia developers in mind. FormatSpecimens is made available to install through BioJulia's package registry.
Julia by default only watches the "General" package registry, so before you start, you should add the BioJulia package registry.
Start a julia terminal, hit the
] key to enter pkg mode (you should see the
prompt change from
pkg> ), then enter the following command:
registry add https://github.com/BioJulia/BioJuliaRegistry.git
After you've added the registry, you can install FormatSpecimens from the julia
] to enter pkg mode again, and enter the following:
This repository consists of a directory for every major format. Directories
contain format specimens along with a file
index.toml. This is a
index.toml contains two arrays, called
invalid. All the
index records for specimen files that are considered valid (i.e. conform to the
format definition) are found in this array.
All the index records for specimen files that are considered invalid (i.e.
violate the format definition in some way) are found in the
Every entry in the
invalid arrays have the following fields:
- filename Specimen filename (required).
- origin (Optional) The contributor or source from which a specimen was taken.
- tags (Optional) One or more words used to group specimens by shared features.
- comments (Optional) Any additional information that might be of interest.
Really the only field absolutely required to retrieve a file using the
FormatSpecimens julia module is filename, but the other fields are useful to
manipulate lists of specimen files in your unit tests.
To get a list of all valid or invalid file specimens for a given format, you can do the following:
goodfiles = list_valid_specimens("FASTQ")
badfiles = list_invalid_specimens("FASTQ")
You can test if a specimen in the list has a given tag, or get an attribute like so:
# Test if the first entry in the list of goodfiles has the tag "dna" in it's
# list of tags...
# Get the comments associated with an entry:
# Get the full path of a file in the entry:
fp = joinpath(path_of_format("FASTQ"), filename(entry))
You can also use do notation in order to filter the records e.g. to list all the valid FASTA files that are of a DNA sequence you can filter by tag:
gooddnafiles = list_valid_specimens("FASTA") do x