Welcome to the MGFunc Wiki
MGFunc is an automated pipeline for annotating metagenomic gene catalogues using remote homology based functional annotation and phylogenetic analysis. It provides a workflow solution on protein level to process large metagenomics data resulting from high-throughput sequencing experiments. MGFunc offers the researchers a customizable tool to identify, cluster and analyze the protein space. The main input to the pipeline is a set of unknown protein sequences and the outcome is a list of annotated ortholog clusters and novel ortholog clusters that are presumably functional homologs.
This Wiki page is created to guide the users in how to acquire and use the pipeline.
MGFunc is distributed as a Python package, that can be downloaded from the GitHub repository in zipped format. Simply extract the zipped file on your own computer or server. Make sure the system requirements and dependencies are met and you can start using the command line tools right away.
MGFunc is written in Python 2.7 and Bash, so it can run on any platform that can support python 2.7, Bash and the pipeline dependencies..
Depending on the size of the input data, the required disk space can vary from 100 to 250 GB.
As MGFunc uses computationally intense programs and algorithms, we implemented threading in some scripts in the pipeline. While a single processor can run the pipeline, multiple processors are recommended to decrease the total runtime.
MGFunc uses of both external programs and Python libraries.
Python libraries that are used in MGFunc are listed in Python Libraries.
Optional: An external customized python tool, goatools, is used. The original version can be found in the Github repository.
Architecture and Implementation
MGFunc is divided into 6 conceptual modules, that consists of several sections. The workflow and the structure can be seen in Figure 1.
MGFunc is put in a Python package structure, where it consists of several scripts, a master script and a config-file. This way the pipeline can be run as a single automated tool. Most scripts are also command line tools that can be run standalone. The different scripts and their relations to sections and modules can be seen in Figure 1.
In this documentation, main options and all sections in the configuration file will be explained and guidelines for how to change the options will be provided.
For running the pipeline as one tool, see the help files for the command line options for the main script, by typing the -h option on the command line or here: MGFunc.v2.py help.
To run the scripts separately, see help files for each script (Help files) and tutorials for individual examples. [link to Tutorials]
MGFunc works with a configuration file where you can set the different options for each section.
The configuration file has the ini format so that it can be properly parsed in the main script (MGFunc.py) using Python’s ConfigParser package. This format consists of section names that are in square brackets(i.e. [beginning]) and options come right after the section header. The option names can start with a dash(“-”) or not and can be a letter, a word or abbreviation. Anything that comes after a section header until the next section header or end of page is considered an option under the previous section header. While most options for sections corresponds to the options for the individual scripts, some are for the pipeline to control the sequence of events. One of these is the “-run” option that is found in Sections 2-14.
Once the important sections in the configuration file is prepared, the pipeline can be run as one sequence of events. The package comes with the default configuration file.
Logical operands in all or most sections:
- -v: Increase printed text. Prints out all options and specific sections from the scripts. (True/False)
- -run: Run the section or not. (True/False)
The config-file is provided in the MGFunc package as MGFunc.ini. In order to run the pipeline you would have to change the first section according to your own system, for example the MGFunc directory on your own system has to be specified.
- An example of the config-file can be found here: Link to the config-file
Guidelines on how to change this first section can be found in the documentation page.
If you are a beginner and you want to run the pipeline with default parameters, you would only need to modify the first section in module 1 of the configuration file.
Advanced users can find explanations for other sections in the documentation page.
Tutorials (in progress...)
Step-by-step guide on how to run MGfunc
Run the whole pipeline
Single genecatalog file:
python2.7 scripts/MGFunc.v2.py -g MGFunctest.fasta -c MGFunc.test.ini -d ../Uniprot/Knowledgebase/uniprot_sprot.dat.gz -v
Multiple files (fx. samples):
Genecatalog-list and the sample files has to be in the same DIR (Directory).
python2.7 scripts/MGFunc.v2.py -gl ../Koala/HKMNZ/genecatlist -c scripts/MGFunc.ini -d ../Uniprot/Knowledgebase/uniprot_sprot.dat.gz -v
Help files for each individual script can be found on the Help files page.