bio4j

bio data graph db

Using Bio4j + Neo4j Graph-algo component for finding protein-protein interaction paths

Hi all!

Today I managed to find some time to check out the Graph-algo component from Neo4j and after playing with it plus Bio4j a bit, I have to say it seems pretty cool. For those who don’t know what I’m talking about, here you have the description you can find in Neo4j wiki:

This is a component that offers implementations of common graph algorithms on top of Neo4j. It is mostly focused around finding paths, like finding the shortest path between two nodes, but it also contains a few different centrality measures, like betweenness centrality for nodes.

The algorithm for finding the shortest path between two nodes caught my attention and I started to wonder how could I give it a try applying it to the data included in Bio4j. I realized then that protein-protein interactions could be a good candidate so I got down to work and created the utility method:

  • findShortestInteractionPath(ProteinNode proteinSource, ProteinNode proteinTarget, int maxDepth, int maxResultsNumber)

for getting at most maxResultsNumber paths between proteinSource and proteinTarget with a maximum path depth of maxDepth. You can check the source code here

I also did a small test program which prints out the paths found between two proteins.

Even though I’ve missed having a wider choice of algorithms, it’s really cool having at least this small set of algorithms already implemented, abstracting you from the low level coding. Apart from that, I’ve been thinking how Bio4j could open a lot of doors for topology/network analysis around all the data it includes. Such analysis could otherwise be quite hard to perform due to several reasons like the lack of data-integration between different datasources and the inner storage paradigm limiting topology/network analysis among others…

With Bio4j however, you just have to move around the nodes and get the information you’re looking for! ;)

@pablopareja

comments

  • alper yilmaz it’s getting more interesting.. :) that’s what I meant by “data-mining” during our skype conference.. cool..

  • Roji I follow neo4j which much itrneest. It is a novel approach, however i think property searches are very important and neo4j is not very good at this.So for example, implementing a complete social website with millions of users would not be feasible with neo4j i think. I am not sure off course.What is also itrneesting is the upcoming of native XML database. They also solve the imdependace mismatch to a certain expend. However their model are trees not graphs, graphs are more general in this sense, but i think more optimizations are possible if you choose trees.

    • ppareja Hi Roji, Could you provide some reasons why you think property searches are not good with Neo4j? Regarding XML databases and other tree-oriented options, they definitely are great for many use cases, however when you have to deal with highly connected data they may not be enough. The case depicted in this blog post is a good example, even just modelling protein-protein interactions would not be possible with a tree – you get plenty of cycles which cannot be expressed with that paradigm…

Bio4j + AWS CloudFormation = your own fresh baked DB in less than a minute!

UPDATE: You can now use this template from **all zones but ap-southeast-1! **

Hi!

So this week it was time to finally start using CloudFormation together with Bio4j. For those not familiar with this AWS service, quoting from their site:

AWS CloudFormation gives developers and systems administrators an easy way to create and manage a collection of related AWS resources, provisioning and updating them in an orderly and predictable fashion.

This is really useful because thanks to CloudFormation templates, you don’t have to worry about manually launching an instance, create a volume, attach it, do some stuff, and then free the resources… You can encapsulate all this tasks in a template reducing all the tasks to just two:

  1. **create **the stack
  2. **delete **the satck whenever you are done with it

This template is available in the following address:

So, let’s see how easy it actually is to create your stack. First you should go to the CloudFormation tab in the amazon console and click the button: Create New Stack:

You will see this new window now where you should choose the option Provide a template URL’ and paste there the URL I just provided before. You should also give your stack a name filling the field Stack name. Then click Continue.

Ok, now you should be seeing this:

Provide then your key-pair name, availability zone, and finally enter the type of instance you want to launch. Once you clicked continue you’ll see a review of all the parameters you entered so far like:

Check everything is as you wish and click continue. You should be seeing then something like this:

Now you just have to wait for about 30 seconds until after refreshing the stack state changes to green color and says “CREATE_COMPLETE”. Click on the output tab and you will see the IP address you need to connect with SSH to your new instance.

So now you just have to connect to your instance and you should have your fresh backed Bio4j DB under the folder /mnt/bio4j_volume/bio4jdb ;)

Whenever you are done, just select delete stack in the amazon console and don’t worry about terminating your instance or deleting your volume, they will do it for you!

The only thing you have to do is umount the volumes you have attached, it seems that CloudFormation cannot do that for you right now…

@pablopareja

Cool GO annotation visualizations with Gephi + Bio4j

Hi everyone!

After a few months without finding the opportunity to play with Gephi, it was already time to dedicate a lab day to this. I thought that a good feature would be having the equivalent .gexf file for the current graph representation available at the tab “GoAnnotation Graph Viz”; so that you could play around with it in Gephi adapting it to your specific needs. Then I got down to work and this is what happened:

First of all I was really happy to see how there was a new version of Gephi (0.8) as well as a good bunch of new (at least for me… :D) layout algorithms plugins available like Parallel Force Atlas, Circular Layout or Layered Layout. So once I have downloaded and installed everything I started to have some fun with it and get to know how filters work, (I haven’t used these ones before). Even though I got stuck a couple of times trying to figure out how to use some of them, I easily solved these small setbacks thanks to the great support found in the Gephi forums, where they quickly answered my newbie questions, thanks Gephi team!

As a source for the graph I used the public EHEC GO annotations we did for the E. coli O104:H4 Genome Analysis Crowdsourcing we coordinated last summer and chose the Molecular Function sub-ontology for the visualization.

When I first loaded the gexf file in Gephi without applying any kind of filters this is what I got:

As you (maybe) can see, the size of GO term nodes is proportional to the number of proteins they annotate; still it pretty much looks just like a big hair-ball…

Then I applied the following set of filters:

in order to get the GO terms with at least 6 protein annotations plus the proteins which are annotating these terms (their neighborhoods); and this is what it looked like (after applying a Parallel Force Atlas layout algorithm):

I decided then to get rid of the protein labels, since they were way too many and not so useful to be seen; for that I used the option: “Hide nodes/edges labels if not in filtered graph”. After doing this and applying the black background preview setting, the visualization finally looked pretty decent:

Please go here to check the version exported with Sea Dragon plugin where you can zoom and move around!

Well, if you like the result (or you don’t but you want to play with this and get a better viz!), I just uploaded a new version of Bio4j GO Tools viewer where you can download the corresponding .gexf file for your GO annotations XML file. Just press the button highlighted in the screenshot and enter the URL for your GO annotations XML file:

You can use the public EHEC GO annotation results URL I used as a sample for this post:

  • https://s3-eu-west-1.amazonaws.com/pablo-tests/EHECAnnotationVersion2.xml

So, that’s all for now, please let me know if you play around with this and get some cool visualizations!

@pablopareja

comments

  • Amrit Good to know it. Does it take expression data also. I have expression data with gene name and probe I’d only. Would you mind to suggest whether it work or not for this kind of data. Thank u so much for your help.

    • Pablo Pareja Hi Amrit, There is no restriction for the input data, the only thing is that the tool expects Uniprot accessions as parameters. You would just need to map your gene names to Uniprot accessions using a ID mapping tool such as that available at uniprot website: http://www.uniprot.org/ (ID mapping tab) Cheers, Pablo

Bio4j Thanksgiving treats!

Hi all!

Thanksgiving is almost here and we got just in time a lot of special treats for you:

New github organization

All bio4j related repositories are now under the organization bio4j in github.

New wiki(s)

The old wiki hosted at wiki.bio4j.com has been moved to the corresponding Bio4j repository wiki. More information has been added as well as structuring the previous data. Besides that, new wikis are being created for each bio4j related tools repositories.

NCBI taxonomy

We’re happy to announce the official incorporation of NCBI taxonomy data into Bio4j DB, as well as an index for retrieving NCBI taxons from gene identifiers (GI); so there’s no need anymore to parse that huge gi_taxid_nucl NCBI table in order to achieve that. There’re no changes made to Uniprot taxonomy but you can now navigate to the equivalent NCBI taxon using the relationship NCBITaxonRel.

Reactome terms

We’ve included Reactome terms references included in Uniprot files, so from now on you can retrieve both all terms associated to a specific protein as well as all proteins associated to an specific term.

New EBS snapshot for this release

For those using AWS (or willing to…) there’s a new public EBS snapshot containing the last version of Bio4j DB. The snapshot details are the following:

  • Snapshot id: snap-aa5cd3c2
  • Snapshot region: EU West (Ireland)
  • Snapshot size: 100 GB

Bio4j DB is under the folder bio4jdb. In order to use it, just create a Bio4jManager instance and start navigating the graph!

UP 2011 Bio4j presentation

We’re really pleased to announce our presence in this year’s UP 2011 Cloud Computing Conference. The presentation will be held on day 4 Thursday, December 8 2011. Check the agenda for the conference here.

Enjoy!

and happy Thanksgiving! ;)

@pablopareja

Bio4jExplorer: familiarize yourself with Bio4j nodes and relationships

Hi!

I just uploaded a new tool aimed to be used both as a reference manual and initial contact for Bio4j domain model: Bio4jExplorer Bio4jExplorer allows you to:

  • Navigate through all nodes and relationships
  • Access the javadocs of any node or relationship
  • Graphically explore the neighborhood of a node/relationship
  • Look up for the different indexes that may serve as an entry point for a node
  • Check incoming/outgoing relationships of a specific node
  • Check start/end nodes of a specific relationship

Both nodes and relationships in the graph visualization are clickable and lead to their respective record. Besides, you can choose between two different layout algorithms: Level layout and Circular layout; (nodes are also draggable so that you can configure the layout as you wish)

For those interested on how this was done, on the server side I created an AWS SimpleDB database holding all the information about the model of Bio4j, i.e. everything regarding nodes, relationships, indexes… (here you can check the program used for creating this database using java aws sdk). Meanwhile, in the client side I used Flare prefuse AS3 library for the graph visualization. As always with everything we do at Oh no sequences!, everything taking part in this tool is open source. You can check the different code repositories at the following addresses:

All kinds of feedback/suggestions are welcome ;)

Bio4j incorporates git-flow model

Hi all! So summer has almost come to an end and I have now time again to continue with the development of Bio4j project. This autumn comes with plenty of new features that will be released in the next few months; the first of them: git-flow model. Bio4j is moving forward fast and it was already time to organize its development, supporting it with an adequate model. Here is where git-flow comes in, providing a simple but yet powerful development model where releases, features, hot-fixes… can be managed without having to go crazy putting them all together. Here you have a general schema of the model:

Since @nvie wrote a really good post explaining the details of his model, I’ll just provide this link to it instead of giving a much poorer explanation than the one in the article.

Bio4j current release now available as an AWS snapshot

For those using AWS (or willing to…) I just created a public snapshot containing the last version of Bio4j DB. The snapshot details are the following:

  • Snapshot id: snap-25192d4c
  • Snapshot region: EU West (Ireland)
  • Snapshot size: 90 GB

The whole DB is under the folder bio4jdb. In order to use it, just create a Bio4jManager instance and start navigating the graph!

As always, any feedback/comment/question is more than welcome, (post a comment here or a question in the user group).

Improvements in Bio4j Go Tools (Graph visualization)

Hi everyone!

A new version of Bio4j Go Tools viewer is available, it includes improvements in the graph visualization of GO annotation results. These are the new features:

  • Load GO annotation results from URL: There’s no need anymore to upload the XML file with the results everytime you want to see the graph visualization. Just enter the publicly accessible URL of the file and the server will directly get the file for you.
  • **Restrict **the visualization to only **one GO sub-ontology **at a time: Terms belonging to different sub-ontologies (cellular component, biological process, molecular function) are not mixed up anymore.
  • Choice of layout algorithms: You can choose between two different layout algorithms for the visualization, (Yifan Hu and Fruchterman Reingold).
  • Customizable layout algorithm time: Range of 1-10 minutes.

I also made a short tutorial showing most of the features available in the following real-world use case: GO annotation results for Era7 E. coli TY-2482 annotation with BG7 system of BGI V2 assembly

The corresponding GO Annotation results XML file is publicly available here. Just click the button ‘load file from url’ and paste the address of the file.

For those new to Bio4j Go Tools, two external open-source projects are used apart from Bio4j itself:

that’s all for now, keep an eye on the blog/twitter for updates ;)

Bio4j includes RefSeq data now!

Hi all,

After some weeks of hard work I finally finished the importer for RefSeq data. First of all, I should clarify some points about its licensing:

  • Data has been retrieved from the public ftp site for RefSeq complete release. There is no extra/different data coming from other source.
  • Quoting NCBI site: “NCBI places no restrictions on the use or distribution of the GenBank data. However, some submitters may claim patent, copyright, or other intellectual property rights in all or a portion of the data they have submitted.”

Once this has been said I will go into more details of how it’s been done.

Genome elements’ sequences

Sequences are not stored on Bio4j DB but uploaded as separate files to S3 (Amazon Simple Storage Service) instead. Why doing it this way? For several reasons:

  • Having all sequences stored in the DB would take more than a decent amount of space
  • Most queries to the DB wouldn’t be done in terms of the sequence content
  • Relevant data included in RefSeq in terms of performing queries would be information about genes, rnas, genome elements, positions of all these elements, etc (rather than the sequence itself).
  • Sequences are stored in txt files whose filename is the unique version string for the specific genome element, (e.g. NC_012932.txt) That way they can easily be retrieved whenever it’s needed. Plus, S3 service provides a way of extracting a range of bytes from a file without downloading the whole content, so there’s no need of downloading the complete sequence in the case where you already know the range of the sequence you are interested in.

In some cases, no sequence can be uploaded for a genome element. These are the cases where instead of a final sequence, a list of terms as join(x…x)complement(x…x(contig(joing(…)) is provided (I never thought I’d find hundreds of lines with these terms where a sequence was supposed to be…).

Genome elements’ data

Regarding elements, the following are included (this are stored in Bio4j, not S3):

  • m RNA
  • Misc RNA
  • Nc RNA
  • r RNA
  • t RNA
  • rm RNA
  • CDS
  • gene

Data stored for all these elements includes their positions and note attribute (whenever it’s found). I have to say that we decided not to extract more information from the gbff files since it can easily be accessed navigating through Bio4j by means of the connection Uniprot entry <–> RefSeq genome element. Plus, information included in Uniprot releases is much more reliable than that found in RefSeq files.

GO Annotation graph visualizations with Bio4j Go Tools + Gephi Toolkit + SiGMa project

Hello everyone ;)

We’re back from Easter holidays bringing some cool graph visualization stuff.

Bio4j Go Tools includes now a new feature providing you with an interactive graph visualization for protein GO annotations. The url of the app is still the same old one.

On the server side, we’re using Gephi Toolkit for applying layout algorithms while the corresponding Gexf file is generated with the class GephiExporter from BioinfoUtil project. The service is included in the project Bio4jTestServer, specifically the servlet GetGoAnnotationGexfServlet.

Regarding to the client side, we’re using the open-source project SiGMa for graph-visualization.

Here you have a screenshot of a small sample of GO Annotation results: