# Bio4j preprint available

A citable preprint in the bioRxiv describing Bio4j went online yesterday:

It serves (we hope) as a good introduction to what is Bio4j, and what it has to offer; especially so if, for getting a general idea of Bio4j, you would rather read prose than code. If you are using Bio4j for something that you want to publish, citing it is much easier now: all bioRxiv preprints are assigned a DOI. Comments, thoughts, opinions are all more than welcome! We will submit a paper based on this preprint to an open access journal. For completeness, here’s the citation info and the abstract:

### Bio4j: a high-performance cloud-enabled graph-based data platform

Pablo Pareja-Tobes, Raquel Tobes, Marina Manrique, Eduardo Pareja, Eduardo Pareja-Tobes
bioRxivdoi: 10.1101/016758

Background. Next Generation Sequencing and other high-throughput technologies have brought a revolution to the bioinformatics landscape, by offering sheer amounts of data about previously unaccessible domains in a cheap and scalable way. However, fast, reproducible, and cost-effective data analysis at such scale remains elusive. A key need for achieving it is being able to access and query the vast amount of publicly available data, specially so in the case of knowledge-intensive, semantically rich data: incredibly valuable information about proteins and their functions, genes, pathways, or all sort of biological knowledge encoded in ontologies remains scattered, semantically and physically fragmented.

Methods and Results. Guided by this, we have designed and developed Bio4j. It aims to offer a platform for the integration of semantically rich biological data using typed graph models. We have modeled and integrated most publicly available data linked with proteins into a set of interdependent graphs. Data querying is possible through a data model aware Domain Specific Language implemented in Java, letting the user write typed graph traversals over the integrated data. A ready to use cloud-based data distribution, based on the Titan graph database engine is provided; generic data import code can also be used for in-house deployment.

Conclusion. Bio4j represents a unique resource for the current Bioinformatician, providing at once a solution for several key problems: data integration; expressive, high performance data access; and a cost-effective scalable cloud deployment model.

We’ve spent the past few months working really hard on Bio4j. There has not been a lot of updates here basically because there were too many new things happening :)

But now things are stabilizing and it’s about time we start to introduce all the new features and improvements we have in store. In this first post I just want to give an overview of Bio4j’s current state, going into more detail in subsequent posts.

## Bio4j now

### A new graph schema and API

We have now a strongly typed graph schema and traversal API in bio4j/bio4j, based on angulillos (more about angulillos later). With it, you can write traversals over Bio4j data abstractly, and then execute them over any implementation. These queries are checked to be correct both structurally (no source of a vertex) and with respect to the Bio4j schema. Vertices and edges are now part of graphs, which can declare dependencies; writing your own extensions to the model is now much easier than before. As part of these changes we did a thorough graph-per-graph review of the Bio4j model, which resulted in some significant improvements.

Of course a schema is not that useful without actual data conforming to it; we also wrote generic importers for all graphs. These importers can be executed using any implementation of the angulillos API.

### A Titan-based implementation and data distribution

With much of the work already done at the level of bio4j/bio4j, providing a data distribution of Bio4j becomes pretty simple; you just need to

1. implement angulillos for your database technology of choice; this is what you have for Titan in angulillos-titan.
2. if your database has support for type definitions and schemas, create those corresponding to the Bio4j schema; what we do for each graph in bio4j-titan

We finished running the importing process for all graphs just a few hours ago. A pretty sizable .tar containing all the Titan files is available from an S3 bucket. With that you just need to spin an EC2 instance, download and extract that and start using Bio4j. Or, if you don’t want to use AWS, you can of course run the import process on your own infrastructure.

### Angulillos: generic typed property graphs in Java

Writing correct queries for Bio4j was becoming harder and harder as we integrated more databases and resources, and we had no way of expressing the graph schemas, even for documentation purposes. That is what angulillos strives to solve. You can think of angulillos as a strongly typed version of the property graph model: first you describe a graph schema in terms of types, and then you can write generic traversals over it, which are guaranteed to be well-typed. This means that for example

• you cannot retrieve the outgoing edges of and edge
• and you can get the tweets that a user tweeted, but not the users that a tweet follows!

The API is really straightforward to implement, and its only dependency is Java 8 (for Streams and lambdas). angulillos-titan can serve as an example of how this can be done.

### The future

The next post will be dedicated to a tentative roadmap, explaining what we are working on now; A (really nice) Scala API, data distribution and AWS deployment improvements, and new integrations of genomic data sources are coming in the following months!

# Bio4j goes to GSoC mentor summit 2014

I just got home yesterday from San Francisco after attending together with @eparejatobes to the 10th edition of the Google Summer of Code mentor summit. It’s been a great experience that I would like to share with you all in this blog post ;) For those of you who still don’t know what GSoC is, here’s a debrief:

Google Summer of Code is a program that offers student developers stipends to write code for various open source projects. We work with many open source, free software, and technology-related groups to identify and fund projects over a three month period.

This was Bio4j’s first year as a GSoC organization and we got three students who worked in the following projects:

It also was my first experience as a mentor and I must say that I both learned and enjoyed it a lot during the process.

The events started on Friday with a complimentary visit to the theme park Great America, nice! followed by a really cool dinner reception at the San Jose Tech Museum of Innovation where we had surprise speakers such as Linus Torvals plus the opportunity of exploring the geeky exhibits from the museum while having some drinks.

We were supposed to dress smart for a change, which was interesting, seeing all these people wearing nice clothes :)

I must say that I had to watch around 20 minutes of youtube videos before I managed to get the knot tie right… xD

Sessions started early the next day with more than eight simultaneous rooms (without taking into account the impromptu sessions that were organized at the ballroom from time to time) and went on till the evening.

It was the first time that I went to an unconference and I just loved it. It is actually great to have the opportunity to explore the different sessions and meet up with people on the way spontaneously, without all the rigidity that so many times comes with “standard” conferences.

Meeting in person people from the Reactome database project was cool since we plan to include this data source into Bio4j in the near future. It was also nice to see in person some of the guys that I’ve been following on twitter for a while like @braincode among others. I also found a good idea the fact of having both the sticker exchange table and the tea-room filled with chocolates from all over the world! The day ended with a quiz show that I unfortunately couldn’t join but, I read on twitter that it was quite funny.

On Sunday we opened the day with a trip to Googleplex where we could see the actual place where the Google folks work on.

There was some time left for a couple more sessions and then we unfortunately had to say bye to all the new acquaintances we made after attending the closing session at the hotel.

I would like to end this post by thanking all the people that helped out on the organization of this awesome summit. Also a special thanks to @fossygirl, great job!

Stay tuned for the next post, we will be releasing a shiny new version of Bio4j based on Titan very soon ;)

@pablopareja

# Bio4j accepted for Google Summer of Code 2014

We are really excited to announce that Bio4j has been accepted as a mentoring organization for Google Summer of Code 2014. This was the first year we applied for it, and it feels just great being part of this inititative!

We think this is a great opportunity for students, giving them the opportunity to hack on pretty cool stuff around graph databases, bio big data and cloud computing.

## how to participate

If this sounds amazing and you are a student (PhD, masters, undergraduate, whatever) or know someone who is,

1. check our ideas list and then
2. contact a potential mentor or if you don’t know who just @eparejatobes or @pablopareja

# Berkeley Phylogenomics Group receives an NSF grant to develop a graph DB for Big Data challenges in genomics building on Bio4j

The Sjölander Lab at the University of California, Berkeley, has recently been awarded a 250K US dollars EAGER grant from the National Science Foundation to build a graph database for Big Data challenges in genomics. Naturally, they’re building on Bio4j.

The project “EAGER: Towards a self-organizing map and hyper-dimensional information network for the human genome” aims to create a graph database of genome and proteome data for the human genome and related species to allow biologists and computational biologists to mine the information in gene family trees, biological networks and other graph data that cannot be represented effectively in relational databases. For these goals, they will develop on top of the pioneering graph-based bioinformatics platform Bio4j.

We are excited to see how Bio4j is used by top research groups to build cutting-edge bioinformatics solutions” said Eduardo Pareja, Era7 Bioinformatics CEO. “To reach an even broader user base, we are pleased to announce that we now provide versions for both Neo4j and Titan graph databases, for which we have developed another layer of abstraction for the domain model using Blueprints.”

EAGER stands for Early-concept Grants for Exploratory Research”, explained Professor Kimmen Sjölander, head of the Berkeley Phylogenomics Group: “NSF awards these grants to support exploratory work in its early stages on untested, but potentially transformative, research ideas or approaches”. “My lab’s focus is on machine learning methods for Big Data challenges in biology, particularly for graphical data such as gene trees, networks, pathways and protein structures. The limitations of relational database technologies for graph data, particularly BIG graph data, restrict scientists’ ability to get any real information from that data. When we decided to switch to a graph database, we did a lot of research into the options. When we found out about Bio4j, we knew we’d found our solution. The Bio4j team has made our development tasks so much easier, and we look forward to a long and fruitful collaboration in this open-source project”.

@pablopareja

# Bio4j 0.9 the billion relationships are here!

Hi everyone!

So Bio4j 0.9 finally made its way out and it’s here bringing you more than 1 billion relationships. These are approximately its main numbers:

• 1.216.993.547 relationships
• 190.625.351 nodes
• 584.436.429 properties

A lot of new features and improvements have been incorporated including the following, (I will go into more detail in later posts specifically dedicated to each of them)

## Refurbishing the domain model

We have introduced a new level of abstraction for the domain model by decoupling the inner database implementation from the relationships among entities themselves. An interface has been developed for each node and relationship present in the database, including methods to access both the properties of the entity it represents and utility methods that allow to easily navigate to the entities that will be linked to it. All this can be found under the package com.era7.bioinfo.bio4j.model

## New Blueprints layer

Apart from the set of interfaces we’ve developed another layer for the domain model using Blueprints. This way we’re going one step further for making the domain model independent from the choice of database technology.

## New Titan implementation

After the problems we had with the so called supernodes - which are quite common indeed, we decided to give a try to Titan Graph Database technology and see how it behaves in such situation. Both wrapper classes for each entity and importing programs have already been implemented. This new prototype needs however some testing but be sure you’ll be hearing more about this pretty soon! ;)

## Bye bye reference node

We decided to finally stop using the reference node for indexing purposes (actually there’s no use for it anymore in Bio4j). I have to admit it, I never was a fan of it and it was about time to do it. So now auxiliary relationships such as, for instance, MainTaxonRel or MainDatasetRel have been replaced by a standard node index.

## Bug fixes

This new release comes with many fixes including:

1. EnzymeNode: The node type property was not stored in previous releases.
2. DatasetNode: Name property was not properly indexed.
3. OrganismNode: NCBI tax-id property was not stored in some scenarios.
4. Redundant sequence conflict feature relationships have been fixed.
5. Duplicated submissions fixed
6. ProteinUnpublishedObservationCitation relationship was missing
7. The following node types were not properly indexed by their type till now: BookNode, ArticleNode, OnlineArticleNode, SubmissionNode, PatentNode, PublisherNode, OnlineJournalNode, JournalNode

## Java 7

Bio4j uses Java 7 now ;)

Cheers!

@pablopareja

Hi!

Bio4j 0.8 includes a few different data sources and you may not always be interested in having all of them. For example you might be interested in playing around with the Gene Ontology DAG alone and let’s face it, having to import a ~105 GB database to do that wouldn’t make much sense…

That’s why the importing process is modular and customizable, allowing you to import just the data you are interested in. Here’s the big picture of where do entities and relationships come from in the general domain model:

There’s however one thing that you have to keep in mind, you must be coherent when choosing the data sources you want to have included in your database; that’s to say, you cannot have for example the Uniref clusters without previously importing Uniprot KB, otherwise there wouldn’t be proteins to connect to when importing the clusters!

Here you have a basic schema showing the dependencies among the different modules:

(Let me remind you that having here two data sources which are not connected by an arrow does NOT mean that they are not related/connected, but rather if it’s possible to import them alone or instead they need other data sources to be already present in the database )

I’m going to create a wiki page where I will be going into more detail regarding database size and importing process time depending on your modules choice, but meanwhile you can find some more information about how to do this in the Importing Bio4j wiki page.

Have a good day!

@pablopareja

# Bio4j 0.8, some numbers

Hi everyone!

Bio4j 0.8 was recently released and now it’s time to have a deeper look at its numbers (as you can see we are quickly approaching the 1 billion relationships and 100M nodes):

• Number of Relationships: 717.484.649
• Number of Nodes: 92.667.745
• Relationship types: 144
• Node types: 42

Ok, but how are those relationships and nodes distributed among the different types? In this chart you can see the first 20 Relationship types:

Here, the same thing but for the first 20 Node types:

You can also check these two files including the numbers for all the existing types:

All this data was obtained with the program GetNodeAndRelsStatistics.

Have a good weekend!

@pablopareja

# Bio4j 0.8 is here!

Hi everyone!

I’m glad to announce the release of Bio4j 0.8 including more than 5.488.000 new proteins and 3.233.000 genes among others, plus the following improvements and features:

## Pfam families

Bio4j includes now all Pfam families included in Uniprot KB (both Swiss-Prot and TrEMBL). For that, both a new node type and relationship type have been created:

• PfamNode

• ProteinPfamRel (this relationship connects a protein and the respective Pfam families associated to it)

The following properties have been added to the Pfam node including:

• ID
• Name

Besides, an exact index for the Pfam family ID property has also been created ( pfam_id_index ).

## NCBI taxonomy tree GI index improved

Old merged node IDs have been incorporated to the Gene Identifier <–> Taxonomy units index. That means that now all the pairs GI-TaxID which included old merged Tax-ID are also part of the index, resulting on a higher rate of hits when using the index. For that we used the file meged.dmp provided in the official tax dump file provided by the NCBI.

## Bio4j and Bio4jModel projects unification

Bio4j project has absorbed Bio4jModel project from this release on.

Until now, Bio4jModel library included the core classes for the manipulation and traversal of the graph while Bio4j project only included the importing programs. I’ve been thinking for a while that this could be confusing and, since there was no real need to keep them as independent projects, I decided to put it all under Bio4j (you just need one jar file now ;) ).

## Bug fixes

1. MetalIonBindingSiteFeature This feature relationship had an erroneous name assigned and it’s been fixed.

Cheers,

@pablopareja

# New Bio4j general domain model schema available

Hi everyone!

It’s been a few months already since I published the last post but that doesn’t mean that the development process of Bio4j was stopped, but rather, on the contrary, I have been working in the integration of Bio4j with other DB-related projects as well as pipelines and tools. Actually, I’m right now staying in the US for a couple of months working on the implementation and integration of a new database around Bio4j including grasses genomic data as part of a collaboration with the Ohio State University, (I promise to give more details about this and more in next posts).

Ok, but let’s get to the point of this post. Even though there already is available a web tool to explore Bio4j data structure (Bio4jExplorer), I was feeling that something else was missing in order to get the big picture of all the data included and how it’s interrelated. So I got to work and created this general domain model including all node types and relationships (also specifying their cardinality).

I didn’t include “auxiliary” relationships linked to the reference node in order to not pollute the schema with relationships that don’t have any semantic meaning but rather indexing purposes. Also, the text included in both boxes represents different relationships all linking the same nodes -specifically Protein with CommentType and FeatureType. I could have drawn them as the rest but then I would have ended up with a hairball instead of a meaningful schema.

As always, any feedback is welcome!

@pablopareja