Skip to content Skip to sidebar Skip to footer

Apache Beam Google Datastore Readfromdatastore Entity Protobuf

I am trying to use apache beam's google datastore api to ReadFromDatastore p = beam.Pipeline(options=options) (p | 'Read from Datastore' >> ReadFromDatastore(gcloud_options.

Solution 1:

I was getting the same issue and the accepted answer didn't work for me.

The OP has 3 questions:

1. Is there something I can do to convert a entity_pb2.Entity to something usable?

You don't specify exactly what difficulty you are having in using the returned value but all instances of entity_pb2.Entity should have a properties property. You should then be able to use that to get the values out of your entity. e.g. property_value = entity.properties.get('<your_property_name>')


Update: I think I now might know what the OP meant by "usable", as even when you do property_value = entity.properties.get('<your_property_name>') the value you get in property_value is in the protocol buffer format... So to get a dict of properties you can do this...

from googledatastore import helper

value_dict = dict((prop_name, helper.get_value(entity.properties.get(prop_name)),) for prop_name in entity.properties)

2. Is the ReadFromDatastore just too new for real use right now?

I too initially thought the same but I seem to have it working now (see my answer to Q3 below).

3. Is there another approach I should be using?

You absolutely must not import the google-cloud-datastore library into your project. Doing so will cause the TypeError: Couldn't build proto file into descriptor pool! error that was in your original question to be raised when you import ReadFromDatastore from apache_beam.

From the investigation/debugging I've been doing it seems that the current version of the apache-beam (v2.8.0) library is simply incompatible with the google-cloud-datastore (v1.7.1) library. This means we must instead use the bundledgoogledatastore (v7.0.1) library instead to achieve what we want.

Further reading / reference(s):

https://cloud.google.com/blog/products/gcp/how-to-do-data-processing-and-analytics-from-google-app-engine-with-google-cloud-dataflow

https://github.com/amygdala/gae-dataflow

https://gcloud-python.readthedocs.io/en/0.10.0/_modules/gcloud/datastore/helpers.html

Solution 2:

An alternative (and easier) way to specify the query is the following:

from google.cloud import datastore
from google.cloud.datastore import query as datastore_query
from apache_beam.io.gcp.datastore.v1.datastoreio import ReadFromDatastore

p = beam.Pipeline(options=pipeline_options)
ds_client = datastore.Client(project=project)
query = ds_client.query(kind=kind)
# possible filter: query.add_filter('column','operator',criteria) 
# query.add_filter('age','>',18)
# query.add_filter('name','=',"John")
query = datastore_query._pb_from_query(query)

p | 'ReadFromDatastore' >> ReadFromDatastore(project=project, query=query)
p.run().wait_until_finish()

When transmitting the job to the DataflowRunner (in the cloud), make sure your local requirements are in line with the setup.py file you are transmitting to google cloud. I have experienced that you must install apache beam 2.1.0 on your local machine and then specifying the same version in your setup.py file in order for it to work on the cloud workers.

Solution 3:

You can use the function google.cloud.datastore.helpers.entity_from_protobuf to convert entity_pb2.Entity to google.cloud.datastore.entity.Entity.

google.cloud.datastore.entity.Entity is a subclass of dict and will give you the usability you need.

Solution 4:

The latest version of Apache Beam 2.13 deprecates this old approach of using the old googledatastore library, and adds a new implementation that uses the newer and more human-friendly google-cloud-datastore library.

https://beam.apache.org/releases/pydoc/2.13.0/apache_beam.io.gcp.datastore.v1new.datastoreio.html

https://github.com/apache/beam/pull/8262

There's still an open issue to add an example, so for now you'll have to figure that part out.

https://issues.apache.org/jira/browse/BEAM-7350

Post a Comment for "Apache Beam Google Datastore Readfromdatastore Entity Protobuf"