A MATLAB library for protein function prediction.
Matlab functions in this package uses "pfp_" (aka. Protein Function Prediction) as the file name prefix.
-
ONT, the ontology structure, seepfp_ontbuild.m. This structure has the following fields:- required fields:
term: a list of term structures (id,name).rel_code: a list of relationship codes, e.g. {is_a,part_of}.DAG: the relationship matrix,DAG(i, j) = k (k>0)means term(i) has k-th relationship ofrel_codewith term(j).ont_type: the ontology type, e.g.molecular_function.date: the date that this structure is built.
- optional fields:
alt_list: the alternate term ID list.
- required fields:
-
OA, the ontology annotation structure, seepfp_oabuild.m. This structure has the following fields:- required fields:
object: the object (sequence) list.ontology: the associated ontology structure.annotation: the annotation matrix,annotation(i, j) = 1means object(i) is annotated with term(j).date: the date that this structure is built.
- required fields:
-
PRED, the prediction structure, see (built by any prediction methods, similar toOA). This structure has the following fields:- required fields:
object: the object (sequence) list.ontology: the associated ontology structure.score: the predicted score matrix,score(i,j)is the predicted score for the association of object(i) and term(j).date: the date that this prediction is made.
- optional fields: (see CAFA rules for details of these fields)
author: the author of this predictor.model: the model number.keywords: the keywords of this predictor.tag: some additional information.
- required fields:
-
NET, the network structure, seepfp_netbuild.m. This structure has the following fields:- required fields:
object: the object (node) list.ADJ: the adjacency matrix.date: the date that this network is built.
- required fields:
- "Training" data
-
Sequences in FASTA format.
-
Annotations (MFO terms for exmaple) for each of these sequences. This data needs to be prepared ahead of time as a two-column CSV file (delimited by TAB)
<sequence ID>\t<GO term ID>where
<sequence ID>would be of any system (e.g., UniProt accession number), as long as they are consistant with those used in the FASTA file.
-
- NCBI BLAST tool (used 2.2.29+ for this document)
- Query sequences in FASTA format.
-
STEP 1: Load annotations of training sequences.
oa = pfp_oabuild(ont, 'annotation.dat');
where
ontis a MATLAB structure of ontology which can be built from and OBO file (say, 'ontology.obo') asont = pfp_ontbuild('ontology.obo');
Note that a typical gene ontology OBO file contains all three GO ontologies (i.e., MFO, BPO, and CCO), therefore,
pfp_ontbuildreturns a cell of THREE ontology strcutures instead:onts = pfp_ontbuild('go.obo')
By default, they are ordered as BPO, CCO, MFO, alphabetically. You can also double check the
.ont_typefield of each returning structure. -
STEP 2: Prepare BLAST results
-
Run
blastpon the query sequences against the "training" sequences by setting output format to be the following:blastp ... -outfmt "6 qseqid sseqid evalue length pident nident" -out blastp.out -
Load the tabular output file (
blastp.outas shown above) into MATLAB:B = pfp_importblastp('blastp.out');
-
-
STEP 3: Build the BLAST predictor
blast = pfp_blast(qseqid, B, oa);
where
qseqidis a cell list of query sequences on which you need scores. Note that it can be just a subset of all those you BLAST'ed.Bis the structure imported step 2, whileoais the ontology annotation structure loaded in step 1.Also, extra options can be specified as additional arguments to this function. See the documentation of
pfp_blast.mfor more details. Thus,blastwill be the BLAST predictor in MATLAB for evaluation.
GOtcha predictor can be build in the similar way of BLAST predictor.
gotcha = pfp_gotcha(qseqid, B, oa);To build a naive predictor, all you need is the ontology annotation structure oa that you have as in the step 1 of making a BLAST predictor. Then run the following in MATLAB:
naive = pfp_naive(qseqid, oa);