Bio4j 0.8 includes a few different data sources and you may not always be interested in having all of them. For example you might be interested in playing around with the Gene Ontology DAG alone and let’s face it, having to import a ~105 GB database to do that wouldn’t make much sense…
That’s why the importing process is modular and customizable, allowing you to import just the data you are interested in.
Here’s the big picture of where do entities and relationships come from in the general domain model:
There’s however one thing that you have to keep in mind, you must be coherent when choosing the data sources you want to have included in your database; that’s to say, you cannot have for example the Uniref clusters without previously importing Uniprot KB, otherwise there wouldn’t be proteins to connect to when importing the clusters!
Here you have a basic schema showing the dependencies among the different modules:
(Let me remind you that having here two data sources which are not connected by an arrow does NOT mean that they are not related/connected, but rather if it’s possible to import them alone or instead they need other data sources to be already present in the database )
I’m going to create a wiki page where I will be going into more detail regarding database size and importing process time depending on your modules choice, but meanwhile you can find some more information about how to do this in the Importing Bio4j wiki page.
ProteinPfamRel (this relationship connects a protein and the respective Pfam families associated to it)
The following properties have been added to the Pfam node including:
Besides, an exact index for the Pfam family ID property has also been created ( pfam_id_index ).
NCBI taxonomy tree GI index improved
Old merged node IDs have been incorporated to the Gene Identifier <–> Taxonomy units index. That means that now all the pairs GI-TaxID which included old merged Tax-ID are also part of the index, resulting on a higher rate of hits when using the index.
For that we used the file meged.dmp provided in the official tax dump file provided by the NCBI.
Until now, Bio4jModel library included the core classes for the manipulation and traversal of the graph while Bio4j project only included the importing programs. I’ve been thinking for a while that this could be confusing and, since there was no real need to keep them as independent projects, I decided to put it all under Bio4j (you just need one jar file now ;) ).
New script for the importing process
You don’t have to worry anymore about manually downloading/decompressing/etc… the sources for the DB in case you want to import Bio4j in your own cluster/machine. Just run the script DownloadAndPrepareBio4jSources.sh and it will do it all for you.
MetalIonBindingSiteFeature This feature relationship had an erroneous name assigned and it’s been fixed.
Well, that’s all for now, I’ll be posting more information about this new release soon ;)
It’s been a few months already since I published the last post but that doesn’t mean that the development process of Bio4j was stopped, but rather, on the contrary, I have been working in the integration of Bio4j with other DB-related projects as well as pipelines and tools. Actually, I’m right now staying in the US for a couple of months working on the implementation and integration of a new database around Bio4j including grasses genomic data as part of a collaboration with the Ohio State University, (I promise to give more details about this and more in next posts).
Ok, but let’s get to the point of this post. Even though there already is available a web tool to explore Bio4j data structure (Bio4jExplorer), I was feeling that something else was missing in order to get the big picture of all the data included and how it’s interrelated. So I got to work and created this general domain model including all node types and relationships (also specifying their cardinality).
I didn’t include “auxiliary” relationships linked to the reference node in order to not pollute the schema with relationships that don’t have any semantic meaning but rather indexing purposes. Also, the text included in both boxes represents different relationships all linking the same nodes -specifically Protein with CommentType and FeatureType. I could have drawn them as the rest but then I would have ended up with a hairball instead of a meaningful schema.
I’m happy to announce a new set of features for our tool Bio4jExplorer plus some changes in its design. I hope this may help both potential and current users to get a better understanding of Bio4j DB structure and contents.
Node & Relationship properties
You can now check with Bio4jExplorer the properties that has either a node or relationship in the table situated on the lower part of the interface. Five columns are included:
Name: property name
Type: property type (String, int, float, String, …)
Indexed: either the property is indexed or not (yes/no)
Index name: name of the index associated to this property -if there’s any
Index name: type of the index associated to this property -if there’s any
Node & Relationship Data source
You can also see now from which source a Node or Relationship was imported, some examples would be Uniprot, Uniref, GO, RefSeq…
Relationships Name property
With this new version you can directly check here the “internal” name of relationships without having to go to the respective javadoc documentation.
This is quite useful when you are writing your Cypher or Gremlin queries, just check it, copy it, and paste it in your query. An example using the relationship shown in the picture would be this query included in the Bio4j Cypher cheatsheet:
Get proteins (accession and names) associated to an interpro motif (limited to 10 results)
START i=node:interpro_id_index(interpro_id_index = "IPR023306")
MATCH i <-[:**PROTEIN_INTERPRO**]- p
return p.accession, p.fullname, p.name, p.short_name
In case you are interested on how the tool is implemented, please go to the previous post about Bio4jExplorer where you can find information about the different code repos and more info.
If you want to check the files including the hard-coded information regarding how nodes, relationships, and indexes are organized, and which is the input for the program which creates the AWS SimpleDB domain, I just uploaded them to the bio4j-public S3 bucket. Please click on their names to download them:
There have already been a good few posts showing different uses and applications of Bio4j, but what about Bio4j data itself?
Today I’m going to show you some basic statistics about the different types of nodes and relationships Bio4j is made up of.
Just as a heads up, here are the general numbers of Bio4j 0.7 :
Number of Relationships: 530.642.683
Number of Nodes: 76.071.411
Relationship types: 139
Node types: 38
Ok, but how are those relationships and nodes distributed among the different types? In this chart you can see the first 20 Relationship types (click on the image below to check the interactive chart):
Here, the same thing but for the first 20 Node types (click on the image below to check the interactive chart):
You can also check these two files including the numbers from all existing types:
Question: When I checked at PubMed, I did not find Neo4j cited in any of the medical literature. I am not a medical professional but am interested in what might promote Bio4j in the medical research community?
It is too good of a resource to be unnoticed.
I’m glad you liked the post.
It’s true that Bio4j may not have caught the attention of many people yet who could definitely make a good use out of it. What are the reasons for that? Well, I think it could be a mixture of factors.
Some people don’t like too much learning new technologies/strategies/workflows… and tend to stick to things they already know as long as possible – which is totally respectable and undestandable. Other people though, may simply not have found about it yet… It’s also possible that due to the lack of a well structured project documentation, potential users get lost in their way when trying to figure out what’s Bio4j about and/or miss the parts they could be interested in.
I could keep on going with more possible reasons that are coming to my mind but still, couldn’t be really objective – it’s me who created this project :D
The point you bring up is actually one of the reasons why we value so much any sort of feedback for the project, (specially constructive ‘bad’ feedback that help us realize its weaknesses)
Let me know if you come up with an idea to let more people know about Bio4j !
I don’t know if you have ever heard of the lowest common ancestor problem in graph theory and computer science but it’s actually pretty simple. As its name says, it consists of finding the common ancestor for two different nodes which has the lowest level possible in the tree/graph.
Even though it is normally defined for only two nodes given it can easily be extended for a set of nodes with an arbitrary size. This is a quite common scenario that can be found across multiple fields and **taxonomy **is one of them.
The reason I’m talking about all this is because today I ran into the need to make use of such algorithm as part of some improvements in our metagenomicsMG7 method. After doing some research looking for existing solutions, I came to the conclusion that I should implement my own, I couldn’t find any applicable implementation that was thought for more than just two nodes.
Ok, but let’s get into detail and see my algorithm:
We start from a set of nodes with an arbitrary length -4 in this sample, which are spread through the taxonomy tree:
We fetch then the first node from the set and calculate its whole ancestor list to the main root of the taxonomy.
Now that we have the list, we take the second node of the set and check if it’s contained in it, if not, we keep going up through its ancestors until we find a hit. Once the hit has been found, we get rid of the previous elements in the list (if any) so that they are not taken into account for the next iterations in the algorithm.
We keep going trough our node set, and C also removes some elements of the list…
Finally we reach the last node of our set, but no element is removed from our list as a result.
The last thing we have to do is simply get the first element of the resulting list and there we have our lowest common ancestor!
This algorithm is encapsulated in the class TaxonomyAlgo, specifically in the static method lowestCommonAncestor() that expects a list of NCBITaxonNode as parameter and returns their LCA node.
You can also check the class LowestCommonAncestorTest where a simple test program that makes use of this method is implemented.
This program expects as parameters:
Bio4j DB folder
An arbitrary number of NCBI taxonomy IDs representing the node set
The Scientific name and the NCBI tax ID of the LCA are printed in the console as result.
Oddly enough, I had to solve this exact problem a few years ago (to see how much of a tree is left after an extinction, for calculating the biodiversity impact) and then just a few weeks ago (but for the unrooted case). Both times I was sure this had to be a solved problem, but there were no obvious solution out there.
I was also quite surprised there wasn’t any ‘official’/obvious solution for this, specially when I’d say it’s quite a common problem.
Now that you mention it, I think I’ll extend the implementation for the unrooted case as well.
By the way, just out of curiosity, what kind of solution did you come up with in the end?
Victor de Jager
interesting post. I solved a very similar problem a few years ago using an early version of the ETE toolkit. http://ete.cgenomics.org/
It’s a well documented with plenty of examples.
Thanks for the link. Just a quick question, is it open-source?
You may be interested in this python script based on the ETE library: https://github.com/jhcepas/ncbi_taxonomy
BTW, ETE is free software
The LCA problem is closely related to the Range Minimum Query problem in graph theory. Working on metagenomics I had to implement a fast algorithm to search for LCA of an arbitrary number of leafs in a taxonomic tree. Given that the tree is always the same, you can pre-process it for fast searches. I ended up implemented the Sparse table algorithm for RMQ explained here:
You say in your post that you couldn’t find any solution out there for more than 2 nodes. The reason is simple: the LCA of N nodes can be decomposed to N-1 times the LCAs of 2 nodes (for example, the LCA of 3 nodes is the LCA of one of them and the LCA of the other 2).
Thanks for the link ;)
In my case though I didn’t want to do any pre-processing on purpose. Having everything stored as a graph gives you a great advantage both in terms of speed and scalability and I didn’t want to throw that away. On the other hand this sort of algorithm is one that could be applied to other sub-graphs of Bio4j, not only the taxonomy tree, so once you implement it in this way it’d be trivial to adapt it to other such cases.
I know that the problem can be decomposed so that you end up with a set of 2-nodes problems, what I meant however was that I expected to find algorithms for this problem with some sort of specific optimizations when dealing with a big set of nodes, not only two. For example somehow not passing again through nodes already visited, which will happen when you do decomposing the problem in “isolated” pairs of nodes.
After writing this post on December, I’ve been thinking of doing something similar, yet different, using Neo4j Cypher query language.
That’s where I came up with the idea of looking for topological patterns through a large sub-set of the Protein-Protein interactions network included in Bio4j; -rather than focusing in a few proteins selected a priori.
I decided to mine the data in order to find circuits/simple cycles of length 3 where at least one protein is from Swiss-Prot dataset:
I would like to point out that the direction here is important and these two cycles:
A --> B --> C --> A
A --> C --> B --> A
are not the same. Ok, so once this has been said, let’s see how the Cypher query looks like:
As you can see it’s really simple and straightforward. In the first two lines we match the proteins from Swiss-Prot dataset for later retrieving the ones which form a 3-length cycle as described before. Once the query has finished, you should be getting something like this:
As you can see the query took about 17 minutes to be completed in a 100% fresh DB -there was no information cached at all yet; with a m1.large AWS machine -this machine has 7.5GB of RAM.
Not bad, right!?
We have to beware of something though, this query returns cycles such as:
A --> B --> C --> A
B --> C --> A --> B
as different cycles when they are actually not. That’s why I developed a simple program to remove these repetitions as well as for fetching some statistics information.
After running the program you get two files:
PPICircuitsLength3NoRepeats file: download it here
All properties found have been incorporated to the enzyme node including:
Node type indexing
From now on, every node present in the database has a property nodeType including its type which has been indexed. That way you can now access all nodes belonging to an specific type really easily.
Availability in all Regions
The AWS region you are based in won’t be a problem for using Bio4j anymore. EBS Snapshots have been created in all regions as well as CloudFormation templates have been updated so that they can now be used regardless the region where you want to create the stack.
Only Asia Pacific (Singapore) ap-southeast-1 region is not ready due to ongoing issues from AWS side regarding extremely slow S3 object downloading. Hope we can find a work around for this soon!
New CloudFormation templates
Basic Bio4j instance (updated)
The basic Bio4j instance template has been updated so that now you can use it from all zones. Check out more info about this in the updated blog post
Basic Bio4j REST server
A new template has been developed so that you can easily deploy your Neo4j-Bio4j REST server in less than a minute.
This template is available in the following address:
The steps you should follow to create the stack are really simple. Actually, you can follow as a guide this blog post about the template I created for deploying a general Neo4j server, only one or two parameters vary
Bio4j REST server
Once you get your server running thanks to the useful template I just mentioned before, using Neo4j WebAdmin with Bio4j as source you will be able to:
Explore you database with the Data browser
Using the data browser tab of the Web administration tool you can explore in real-time the contents of Bio4j!
In order to get visualizations like the one shown above, you should make use of visualization profiles. There you can specify different styles associated to customizable rules which can be expressed in terms of the node properties. Here’s a screenshot showing how the visualization profile I used for the visualization above looks like:
Just beware of one thing, the behavior of the tool is such that it does not distinguish between highly connected nodes and more isolated ones. Because of this, clicking nodes such as Trembl dataset node is not advisable unless you want to see it freeze forever -this node has more than 15 million relationships linking it to proteins.
Run queries with Cypher
Cypher **is a **declarative language which allows for expressive and efficient querying of the graph store without having to write traversers in code. It focuses on the clarity of expressing what to retrieve from a graph, not how to do it, in contrast to imperative languages like Java, and scripting languages like Gremlin.
A query to retrieve protein interaction circuits of length 3 with proteins belonging to Swiss-Prot dataset (limited to 5 results) would look like this in Cypher:
If you want to check out more examples of Bio4j + Cypher, check our Bio4j cypher cheat sheet that we will be updating from time to time.
Querying Bio4j with Gremlin
Gremlins? What do they have to do with Bio4j!?
Gremlin is a graph traversal language that can be natively used in various JVM languages - it currently provides native support for Java, Groovy, and Scala. However, it can express in a few lines of code what it would take many, many lines of code in Java to express.
Querying proteins associated to the interpro motif with id IPR023306 in Bio4j with Gremlin would look like this: (limited to 5 results)
If you want to check out more examples of Bio4j + Gremlin, check our Bio4j gremlin cheat sheet that we will be updating from time to time.
Dataset nodes There was a bug in the importing process which resulted in the creation of a new dataset node everytime a new Uniprot entry was stored. Now everything’s fine!
So that’s all for now! Hope you enjoy all this changes and new features I’ve been working on in the last couple of months. As always, feel free to give any feedback you may have, I’m looking forward to it ;)