bio4j

bio data graph db

Bio4j modules, adapt the database to your own needs

Hi!

Bio4j 0.8 includes a few different data sources and you may not always be interested in having all of them. For example you might be interested in playing around with the Gene Ontology DAG alone and let’s face it, having to import a ~105 GB database to do that wouldn’t make much sense…

That’s why the importing process is modular and customizable, allowing you to import just the data you are interested in. Here’s the big picture of where do entities and relationships come from in the general domain model:

There’s however one thing that you have to keep in mind, you must be coherent when choosing the data sources you want to have included in your database; that’s to say, you cannot have for example the Uniref clusters without previously importing Uniprot KB, otherwise there wouldn’t be proteins to connect to when importing the clusters!

Here you have a basic schema showing the dependencies among the different modules:

(Let me remind you that having here two data sources which are not connected by an arrow does NOT mean that they are not related/connected, but rather if it’s possible to import them alone or instead they need other data sources to be already present in the database )

I’m going to create a wiki page where I will be going into more detail regarding database size and importing process time depending on your modules choice, but meanwhile you can find some more information about how to do this in the Importing Bio4j wiki page.

Have a good day!

@pablopareja

Bio4j 0.8, some numbers

Hi everyone!

Bio4j 0.8 was recently released and now it’s time to have a deeper look at its numbers (as you can see we are quickly approaching the 1 billion relationships and 100M nodes):

  • Number of Relationships: 717.484.649
  • Number of Nodes: 92.667.745
  • Relationship types: 144
  • Node types: 42

Ok, but how are those relationships and nodes distributed among the different types? In this chart you can see the first 20 Relationship types:

Here, the same thing but for the first 20 Node types:

You can also check these two files including the numbers for all the existing types:

All this data was obtained with the program GetNodeAndRelsStatistics.

Have a good weekend!

@pablopareja

Bio4j 0.8 is here!

Hi everyone!

I’m glad to announce the release of Bio4j 0.8 including more than 5.488.000 new proteins and 3.233.000 genes among others, plus the following improvements and features:

Pfam families

Bio4j includes now all Pfam families included in Uniprot KB (both Swiss-Prot and TrEMBL). For that, both a new node type and relationship type have been created:

  • PfamNode

  • ProteinPfamRel (this relationship connects a protein and the respective Pfam families associated to it)

The following properties have been added to the Pfam node including:

  • ID
  • Name

Besides, an exact index for the Pfam family ID property has also been created ( pfam_id_index ).

NCBI taxonomy tree GI index improved

Old merged node IDs have been incorporated to the Gene Identifier <–> Taxonomy units index. That means that now all the pairs GI-TaxID which included old merged Tax-ID are also part of the index, resulting on a higher rate of hits when using the index. For that we used the file meged.dmp provided in the official tax dump file provided by the NCBI.

Bio4j and Bio4jModel projects unification

Bio4j project has absorbed Bio4jModel project from this release on.

Until now, Bio4jModel library included the core classes for the manipulation and traversal of the graph while Bio4j project only included the importing programs. I’ve been thinking for a while that this could be confusing and, since there was no real need to keep them as independent projects, I decided to put it all under Bio4j (you just need one jar file now ;) ).

New script for the importing process

You don’t have to worry anymore about manually downloading/decompressing/etc… the sources for the DB in case you want to import Bio4j in your own cluster/machine. Just run the script DownloadAndPrepareBio4jSources.sh and it will do it all for you.

Bug fixes

  1. MetalIonBindingSiteFeature This feature relationship had an erroneous name assigned and it’s been fixed.

Well, that’s all for now, I’ll be posting more information about this new release soon ;)

Cheers,

@pablopareja

New Bio4j general domain model schema available

Hi everyone!

It’s been a few months already since I published the last post but that doesn’t mean that the development process of Bio4j was stopped, but rather, on the contrary, I have been working in the integration of Bio4j with other DB-related projects as well as pipelines and tools. Actually, I’m right now staying in the US for a couple of months working on the implementation and integration of a new database around Bio4j including grasses genomic data as part of a collaboration with the Ohio State University, (I promise to give more details about this and more in next posts).

Ok, but let’s get to the point of this post. Even though there already is available a web tool to explore Bio4j data structure (Bio4jExplorer), I was feeling that something else was missing in order to get the big picture of all the data included and how it’s interrelated. So I got to work and created this general domain model including all node types and relationships (also specifying their cardinality).

I didn’t include “auxiliary” relationships linked to the reference node in order to not pollute the schema with relationships that don’t have any semantic meaning but rather indexing purposes. Also, the text included in both boxes represents different relationships all linking the same nodes -specifically Protein with CommentType and FeatureType. I could have drawn them as the rest but then I would have ended up with a hairball instead of a meaningful schema.

As always, any feedback is welcome!

@pablopareja

Bio4jExplorer, new features and design!

Hello everyone,

I’m happy to announce a new set of features for our tool Bio4jExplorer plus some changes in its design. I hope this may help both potential and current users to get a better understanding of Bio4j DB structure and contents.

Node & Relationship properties

You can now check with Bio4jExplorer the properties that has either a node or relationship in the table situated on the lower part of the interface. Five columns are included:

  • Name: property name
  • Type: property type (String, int, float, String[], …)
  • Indexed: either the property is indexed or not (yes/no)
  • Index name: name of the index associated to this property -if there’s any Index name: type of the index associated to this property -if there’s any

Node & Relationship Data source

You can also see now from which source a Node or Relationship was imported, some examples would be Uniprot, Uniref, GO, RefSeq…

Relationships Name property

With this new version you can directly check here the “internal” name of relationships without having to go to the respective javadoc documentation.

This is quite useful when you are writing your Cypher or Gremlin queries, just check it, copy it, and paste it in your query. An example using the relationship shown in the picture would be this query included in the Bio4j Cypher cheatsheet:

Get proteins (accession and names) associated to an interpro motif (limited to 10 results)

1
2
3
4
5
> 
START i=node:interpro_id_index(interpro_id_index = "IPR023306")
 MATCH i <-[:**PROTEIN_INTERPRO**]- p
 return p.accession, p.fullname, p.name, p.short_name
 limit 10

The url for Bio4jExplorer is the same as before:

In case you are interested on how the tool is implemented, please go to the previous post about Bio4jExplorer where you can find information about the different code repos and more info.

If you want to check the files including the hard-coded information regarding how nodes, relationships, and indexes are organized, and which is the input for the program which creates the AWS SimpleDB domain, I just uploaded them to the bio4j-public S3 bucket. Please click on their names to download them:

I wish you all a great weekend!

I’ll have mine at the beach enjoying our great springy weather with lots of sun down here in Andalucia ;)

@pablopareja

Bio4j 0.7, some numbers

Hi everyone!

There have already been a good few posts showing different uses and applications of Bio4j, but what about Bio4j data itself? Today I’m going to show you some basic statistics about the different types of nodes and relationships Bio4j is made up of. Just as a heads up, here are the general numbers of Bio4j 0.7 :

  • Number of Relationships: 530.642.683
  • Number of Nodes: 76.071.411
  • Relationship types: 139
  • Node types: 38

Ok, but how are those relationships and nodes distributed among the different types? In this chart you can see the first 20 Relationship types (click on the image below to check the interactive chart):

Here, the same thing but for the first 20 Node types (click on the image below to check the interactive chart):

You can also check these two files including the numbers from all existing types:

All this data was obtained with the program GetNodeAndRelsStatistics.

Have a good day!

@pablopareja

comments

  • Patrick Durusau Excellent! Question: When I checked at PubMed, I did not find Neo4j cited in any of the medical literature. I am not a medical professional but am interested in what might promote Bio4j in the medical research community? It is too good of a resource to be unnoticed. Patrick

    • ppareja Hi Patrick, I’m glad you liked the post. It’s true that Bio4j may not have caught the attention of many people yet who could definitely make a good use out of it. What are the reasons for that? Well, I think it could be a mixture of factors. Some people don’t like too much learning new technologies/strategies/workflows… and tend to stick to things they already know as long as possible – which is totally respectable and undestandable. Other people though, may simply not have found about it yet… It’s also possible that due to the lack of a well structured project documentation, potential users get lost in their way when trying to figure out what’s Bio4j about and/or miss the parts they could be interested in. I could keep on going with more possible reasons that are coming to my mind but still, couldn’t be really objective – it’s me who created this project :D The point you bring up is actually one of the reasons why we value so much any sort of feedback for the project, (specially constructive ‘bad’ feedback that help us realize its weaknesses) Let me know if you come up with an idea to let more people know about Bio4j ! Pablo

Bio4j REST Server configures itself now thanks to the updated CF template

Hi all,

I just wanted to write a very short post informing about the changes in the Bio4jBasicRestServerTemplate.

Template what!?

If that’s what you’re thinking, please go here to get an idea of what’s this all about.

From now on, this CloudFormation template adapts the server configuration files:

  • neo4j-wrapper.conf
  • neo4j.properties

to the characteristics of the instance type the server is running in, so that it can make the best out of it.

These configurations assume that the server is running alone in the machine.

For that I created these two new mappings in the template:

  • AWSInstanceType2WrapperConfFile
  • AWSInstanceType2Neo4jPropertiesFile

Default configuration values are available in the bio4j-public S3 bucket. For example in order to have access to the server configuration files of a m1.xlarge instance, just go to this url:

same thing for the other file:

If you want to check the conf files for any other instance type, you just have to change the instance type name in the urls linked above.

Have a good weekend!

@pablopareja

Finding the lowest common ancestor of a set of NCBI taxonomy nodes with Bio4j

I don’t know if you have ever heard of the lowest common ancestor problem in graph theory and computer science but it’s actually pretty simple. As its name says, it consists of finding the common ancestor for two different nodes which has the lowest level possible in the tree/graph.

Even though it is normally defined for only two nodes given it can easily be extended for a set of nodes with an arbitrary size. This is a quite common scenario that can be found across multiple fields and **taxonomy **is one of them.

The reason I’m talking about all this is because today I ran into the need to make use of such algorithm as part of some improvements in our metagenomics MG7 method. After doing some research looking for existing solutions, I came to the conclusion that I should implement my own, I couldn’t find any applicable implementation that was thought for more than just two nodes.

Ok, but let’s get into detail and see my algorithm:

We start from a set of nodes with an arbitrary length -4 in this sample, which are spread through the taxonomy tree:

We fetch then the first node from the set and calculate its whole ancestor list to the main root of the taxonomy.

Now that we have the list, we take the second node of the set and check if it’s contained in it, if not, we keep going up through its ancestors until we find a hit. Once the hit has been found, we get rid of the previous elements in the list (if any) so that they are not taken into account for the next iterations in the algorithm.

We keep going trough our node set, and C also removes some elements of the list…

Finally we reach the last node of our set, but no element is removed from our list as a result.

The last thing we have to do is simply get the first element of the resulting list and there we have our lowest common ancestor!

This algorithm is encapsulated in the class TaxonomyAlgo, specifically in the static method lowestCommonAncestor() that expects a list of NCBITaxonNode as parameter and returns their LCA node.

You can also check the class LowestCommonAncestorTest where a simple test program that makes use of this method is implemented.

This program expects as parameters:

  1. Bio4j DB folder
  2. An arbitrary number of NCBI taxonomy IDs representing the node set

The Scientific name and the NCBI tax ID of the LCA are printed in the console as result.

Enjoy!

@pablopareja

comments

  • Paul Agapow Oddly enough, I had to solve this exact problem a few years ago (to see how much of a tree is left after an extinction, for calculating the biodiversity impact) and then just a few weeks ago (but for the unrooted case). Both times I was sure this had to be a solved problem, but there were no obvious solution out there.

    • Pablo Pareja Hi Paul, I was also quite surprised there wasn’t any ‘official’/obvious solution for this, specially when I’d say it’s quite a common problem. Now that you mention it, I think I’ll extend the implementation for the unrooted case as well. By the way, just out of curiosity, what kind of solution did you come up with in the end?
  • Victor de Jager Hi Pablo, interesting post. I solved a very similar problem a few years ago using an early version of the ETE toolkit. http://ete.cgenomics.org/ It’s a well documented with plenty of examples.

    • ppareja Hi Victor, Thanks for the link. Just a quick question, is it open-source?
  • Jaime Hi, You may be interested in this python script based on the ETE library: https://github.com/jhcepas/ncbi_taxonomy BTW, ETE is free software

  • Miguel The LCA problem is closely related to the Range Minimum Query problem in graph theory. Working on metagenomics I had to implement a fast algorithm to search for LCA of an arbitrary number of leafs in a taxonomic tree. Given that the tree is always the same, you can pre-process it for fast searches. I ended up implemented the Sparse table algorithm for RMQ explained here: [](http://community.topcoder.com/tc?module=Static&d1=tutorials&d2=lowestCommonAncestor) You say in your post that you couldn’t find any solution out there for more than 2 nodes. The reason is simple: the LCA of N nodes can be decomposed to N-1 times the LCAs of 2 nodes (for example, the LCA of 3 nodes is the LCA of one of them and the LCA of the other 2).

    • ppareja Hi Miguel, Thanks for the link ;) In my case though I didn’t want to do any pre-processing on purpose. Having everything stored as a graph gives you a great advantage both in terms of speed and scalability and I didn’t want to throw that away. On the other hand this sort of algorithm is one that could be applied to other sub-graphs of Bio4j, not only the taxonomy tree, so once you implement it in this way it’d be trivial to adapt it to other such cases. I know that the problem can be decomposed so that you end up with a set of 2-nodes problems, what I meant however was that I expected to find algorithms for this problem with some sort of specific optimizations when dealing with a big set of nodes, not only two. For example somehow not passing again through nodes already visited, which will happen when you do decomposing the problem in “isolated” pairs of nodes.

Mining Bio4j data: finding topological patterns in PPI networks

Hi everyone!

After writing this post on December, I’ve been thinking of doing something similar, yet different, using Neo4j Cypher query language.

That’s where I came up with the idea of looking for topological patterns through a large sub-set of the Protein-Protein interactions network included in Bio4j; -rather than focusing in a few proteins selected a priori.

I decided to mine the data in order to find circuits/simple cycles of length 3 where at least one protein is from Swiss-Prot dataset:

I would like to point out that the direction here is important and these two cycles:

  • A --> B --> C --> A
  • A --> C --> B --> A

are not the same. Ok, so once this has been said, let’s see how the Cypher query looks like:

1
2
3
4
START d=node:dataset_name_index(dataset_name_index = "Swiss-Prot")
MATCH d <-[r:PROTEIN_DATASET]- p, 
circuit = (p) -[:PROTEIN_PROTEIN_INTERACTION]-> (p2) -[:PROTEIN_PROTEIN_INTERACTION]-> (p3) -[:PROTEIN_PROTEIN_INTERACTION]-> (p)
 return p.accession, p2.accession, p3.accession

As you can see it’s really simple and straightforward. In the first two lines we match the proteins from Swiss-Prot dataset for later retrieving the ones which form a 3-length cycle as described before. Once the query has finished, you should be getting something like this:

1
2
3
4
5
6
7
8
9
10
11
cypher> 
==> +---------------------------------------------------------+
p.accession | p2.accession | p3.accession | 
==> +---------------------------------------------------------+
Q08465 P35189 P3421
Q08465 P34218 P35189
Q8GXA4 Q8L7E5 Q9LE82
Q8GXA4 Q9FH18 Q8L7E5
....
==> +---------------------------------------------------------+
==> 6632 rows, 1019211 ms

As you can see the query took about 17 minutes to be completed in a 100% fresh DB -there was no information cached at all yet; with a m1.large AWS machine -this machine has 7.5GB of RAM.

Not bad, right!?

We have to beware of something though, this query returns cycles such as:

  • A --> B --> C --> A
  • B --> C --> A --> B

as different cycles when they are actually not. That’s why I developed a simple program to remove these repetitions as well as for fetching some statistics information. After running the program you get two files:

  1. PPICircuitsLength3NoRepeats file: download it here
  2. PPICircuitsProteinsFreq file: download it here.

The final circuits found were reduced after performing the filtering to 2226 records.

Finally, I also created a really simple chart including the absolute frequency of the first 20 proteins with more occurrences in the cycles that were found.

Well, that’s all for now. Have a good day!

@pablopareja

Bio4j release 0.7 is out!

Hi!

I’m happy to announce that the version 0.7 of Bio4j has been released. Check out the wide set of new features, tools and improvements:

Expasy Enzyme database integration

From now on you have the whole Enzyme DB included in Bio4j. For that, both a new node type and relationship type have been created:

All properties found have been incorporated to the enzyme node including:

  • ID
  • Official name
  • Alternate names
  • Cofactors
  • Comments
  • Catalytic activity
  • Prosite cross-references

Node type indexing

From now on, every node present in the database has a property nodeType including its type which has been indexed. That way you can now access all nodes belonging to an specific type really easily.

Availability in all Regions

The AWS region you are based in won’t be a problem for using Bio4j anymore. EBS Snapshots have been created in all regions as well as CloudFormation templates have been updated so that they can now be used regardless the region where you want to create the stack.

Only Asia Pacific (Singapore) ap-southeast-1 region is not ready due to ongoing issues from AWS side regarding extremely slow S3 object downloading. Hope we can find a work around for this soon!

New CloudFormation templates

Basic Bio4j instance (updated)

The basic Bio4j instance template has been updated so that now you can use it from all zones. Check out more info about this in the updated blog post

Basic Bio4j REST server

A new template has been developed so that you can easily deploy your Neo4j-Bio4j REST server in less than a minute.

This template is available in the following address:

The steps you should follow to create the stack are really simple. Actually, you can follow as a guide this blog post about the template I created for deploying a general Neo4j server, only one or two parameters vary

Bio4j REST server

Once you get your server running thanks to the useful template I just mentioned before, using Neo4j WebAdmin with Bio4j as source you will be able to:

Explore you database with the Data browser

Using the data browser tab of the Web administration tool you can explore in real-time the contents of Bio4j!

In order to get visualizations like the one shown above, you should make use of visualization profiles. There you can specify different styles associated to customizable rules which can be expressed in terms of the node properties. Here’s a screenshot showing how the visualization profile I used for the visualization above looks like:

Just beware of one thing, the behavior of the tool is such that it does not distinguish between highly connected nodes and more isolated ones. Because of this, clicking nodes such as Trembl dataset node is not advisable unless you want to see it freeze forever -this node has more than 15 million relationships linking it to proteins.

Run queries with Cypher

Cypher what?!

Cypher **is a **declarative language which allows for expressive and efficient querying of the graph store without having to write traversers in code. It focuses on the clarity of expressing what to retrieve from a graph, not how to do it, in contrast to imperative languages like Java, and scripting languages like Gremlin.

A query to retrieve protein interaction circuits of length 3 with proteins belonging to Swiss-Prot dataset (limited to 5 results) would look like this in Cypher:

1
2
3
4
5
START d=node:dataset_name_index(dataset_name_index = "Swiss-Prot")
 MATCH d <-[r:PROTEIN_DATASET]- p, 
 circuit = (p) -[:PROTEIN_PROTEIN_INTERACTION]-> (p2) -[:PROTEIN_PROTEIN_INTERACTION]-> (p3) -[:PROTEIN_PROTEIN_INTERACTION]-> (p)
 return p.accession, p2.accession, p3.accession, p.accession
 limit 5

If you want to check out more examples of Bio4j + Cypher, check our Bio4j cypher cheat sheet that we will be updating from time to time.

Querying Bio4j with Gremlin

Gremlins? What do they have to do with Bio4j!?

Gremlin is a graph traversal language that can be natively used in various JVM languages - it currently provides native support for Java, Groovy, and Scala. However, it can express in a few lines of code what it would take many, many lines of code in Java to express.

Querying proteins associated to the interpro motif with id IPR023306 in Bio4j with Gremlin would look like this: (limited to 5 results)

1
2
3
4
5
6
7
gremlin> g.idx('interpro_id_index')[['interpro_id_index':'IPR023306']].inE('PROTEIN_INTERPRO').outV.accession[0..4]
==> E2GK26
==> G3PMS4
==> G3Q865
==> G3PIL8
==> G3NNA4
gremlin> 

If you want to check out more examples of Bio4j + Gremlin, check our Bio4j gremlin cheat sheet that we will be updating from time to time.

Bug fixes

  1. Dataset nodes There was a bug in the importing process which resulted in the creation of a new dataset node everytime a new Uniprot entry was stored. Now everything’s fine!

So that’s all for now! Hope you enjoy all this changes and new features I’ve been working on in the last couple of months. As always, feel free to give any feedback you may have, I’m looking forward to it ;)

@pablopareja