Bioinformatics is an invaluable tool for this research. Several of the Coenzyme A (CoA) ligase genes were known at the beginning of our investigation, but identification of the full complement of ligase enzymes was quite difficult. Using known CoA ligase protein sequences (from Arabidopsis and other organisms), we identified approximately 30 of the family members. Repeated use of the Patmatch tool at The Arabidopsis Information Resource (TAIR) with variations on the 12-residue Box1 motif combined with careful analysis of the identified sequences was required to complete the set of 63 genes. Finally, it was the characterization of JAR1 by Staswick et al. that allowed us to realize that the 19 genes in clade III are not CoA ligases. The remaining 6 clades all have at least one confirmed CoA ligase based on our assays, so we believe all 44 enzymes are CoA ligases (or mechanistically similar acyl carrier protein (ACP) ligases). We have identified only a subset of CoA thioesterases. Known CoA thioesterases from other organisms belong to several different families and there is no unifying PROSITE motif as there is for the ligases. Nevertheless, the 15 genes that have been identified to date provide an excellent starting point for investigating the functions of these enzymes.

Crystal structures have been solved for four AMP-binding proteins (luciferase, PheA, DhbE, and acetyl-CoA synthetase (ACS)) with various bound ligands. Using information from these structures, molecular modeling approaches have contributed important information about acyl-CoA synthetases, 4-coumarate-CoA ligases, as well as the non-CoA-ligase acyl activating enzymes (AAEs) of clade III. Computer subroutines that attempt to predict subcellular targeting of proteins provide only provisional information. We are somewhat fortunate in that the statistically most reliable predictions are for peroxisomal targeting signals – particularly the C-terminal PTS1 sequence (SKL or similar).

Other database information that we have already investigated or are assembling include pooled and tissue specific EST data, upstream promoter regions that will be analyzed for known cis-regulatory elements using PLACE and plant CARE servers, and microarray databases, all available through TAIR.