Bio4j accepted as a mentoring org for Google Summer of Code 2014!! check the blog post
Bio4j is a high-performance cloud-enabled graph-based open source bioinformatics data platform, integrating the data available in the most representative open data sources around protein information. It integrates the data available in
- UniProtKB (SwissProt + Trembl)
- Gene Ontology (GO)
- UniRef (50, 90, 100)
- NCBI taxonomy
- Expasy Enzyme
Bio4j provides a completely new and powerful framework for protein related information querying and management. Since it relies on a high-performance graph engine, data is stored in a way that semantically represents its own structure. On the contrary, traditional relational databases must flatten the data they represent into tables, creating artificial ids in order to connect the different tuples; which can in some cases eventually lead to domain models that have almost nothing to do with the actual structure of data.
This set of slides from FOSDEM 2014 should give an up-to-date overview of the project:
First of all, Bio4j has an Abstract Domain Model, which allows you to use it without binding to a particular backend implementation.
Next, it has an intermediate Blueprints layer, which allows us to make a default implementation of the abstract interface using Tinkerpop Blueprints API and at the same time stay independent from the choice of database technology.
And finally, there are technology specific versions:
- Titan DB implementation
- Neo4j DB implementation
Bio4j includes a few different data sources and you may not always be interested in having all of them together. That’s why the importing process is modular and customizable, allowing you to import just the data you are interested in.
Also, Bio4j has Statika-based module system, which dramatically simplifies the process of building and deploying custom releases of Bio4j.
In Bio4j data is organized in a way semantically equivalent to what it represents thanks to the graph structure. That means that queries which would even be impossible to perform with a standard Relational DB, can be feasible with Bio4j obtaining good performance results.
Bio4j is an open source platform released under AGPLv3.