Apache Beam Google Datastore Readfromdatastore Entity Protobuf
Solution 1:
I was getting the same issue and the accepted answer didn't work for me.
The OP has 3 questions:
1. Is there something I can do to convert a entity_pb2.Entity to something usable?
You don't specify exactly what difficulty you are having in using the returned value but all instances of entity_pb2.Entity should have a properties
property. You should then be able to use that to get the values out of your entity. e.g. property_value = entity.properties.get('<your_property_name>')
Update: I think I now might know what the OP meant by "usable", as even when you do property_value = entity.properties.get('<your_property_name>')
the value you get in property_value
is in the protocol buffer format... So to get a dict of properties you can do this...
from googledatastore import helper
value_dict = dict((prop_name, helper.get_value(entity.properties.get(prop_name)),) for prop_name in entity.properties)
2. Is the ReadFromDatastore just too new for real use right now?
I too initially thought the same but I seem to have it working now (see my answer to Q3 below).
3. Is there another approach I should be using?
You absolutely must not import the google-cloud-datastore
library into your project. Doing so will cause the TypeError: Couldn't build proto file into descriptor pool!
error that was in your original question to be raised when you import ReadFromDatastore
from apache_beam
.
From the investigation/debugging I've been doing it seems that the current version of the apache-beam (v2.8.0)
library is simply incompatible with the google-cloud-datastore (v1.7.1)
library. This means we must instead use the bundledgoogledatastore (v7.0.1)
library instead to achieve what we want.
Further reading / reference(s):
https://github.com/amygdala/gae-dataflow
https://gcloud-python.readthedocs.io/en/0.10.0/_modules/gcloud/datastore/helpers.html
Solution 2:
An alternative (and easier) way to specify the query is the following:
from google.cloud import datastore
from google.cloud.datastore import query as datastore_query
from apache_beam.io.gcp.datastore.v1.datastoreio import ReadFromDatastore
p = beam.Pipeline(options=pipeline_options)
ds_client = datastore.Client(project=project)
query = ds_client.query(kind=kind)
# possible filter: query.add_filter('column','operator',criteria)
# query.add_filter('age','>',18)
# query.add_filter('name','=',"John")
query = datastore_query._pb_from_query(query)
p | 'ReadFromDatastore' >> ReadFromDatastore(project=project, query=query)
p.run().wait_until_finish()
When transmitting the job to the DataflowRunner (in the cloud), make sure your local requirements are in line with the setup.py file you are transmitting to google cloud. I have experienced that you must install apache beam 2.1.0 on your local machine and then specifying the same version in your setup.py file in order for it to work on the cloud workers.
Solution 3:
You can use the function google.cloud.datastore.helpers.entity_from_protobuf
to convert entity_pb2.Entity
to google.cloud.datastore.entity.Entity
.
google.cloud.datastore.entity.Entity
is a subclass of dict and will give you the usability you need.
Solution 4:
The latest version of Apache Beam 2.13 deprecates this old approach of using the old googledatastore
library, and adds a new implementation that uses the newer and more human-friendly google-cloud-datastore
library.
https://beam.apache.org/releases/pydoc/2.13.0/apache_beam.io.gcp.datastore.v1new.datastoreio.html
https://github.com/apache/beam/pull/8262
There's still an open issue to add an example, so for now you'll have to figure that part out.
Post a Comment for "Apache Beam Google Datastore Readfromdatastore Entity Protobuf"