Transcribing Large Datasets Into Neo4j Via Python (Py2neo)

August 23, 2022 Post a Comment

I've spent the last few weeks trying to load a genomic dataset into Neo4j, using the Scikit Allel library. I've managed to load all variants for the exomes in a VCF file and all s

Solution 1:

Creating a lot of elements individually will always be slow, largely because of the number of network hops required. You're also carrying out matches in between each as well, which will further increase the time.

The best approach to this kind of problem is to look at batching, both for reads and for writes. While you can't do everything at once either, batching your operations into at least a few hundred at a time will have a significant effect. In your case, you will probably need to carry out a bulk read, followed by a bulk write, and so on.

So specifically, look to carry out a match against multiple entities (you might be able to use an "in" modifier for this, or you might need to drop into raw Cypher). For writes, build up a subgraph locally with the relevant nodes and relationships and create that in a single call.

Your optimal batch size will only be discovered through experimentation, so you likely won't get this right first time. But batching is definitely the key here.

Python Guru

Transcribing Large Datasets Into Neo4j Via Python (Py2neo)

Solution 1:

Post a Comment for "Transcribing Large Datasets Into Neo4j Via Python (Py2neo)"