AppEngine Dataflow SpringBoot
In a past blog I wrote about Scheduling Dataflow pipeline. There I described how to leverage google’s app engine and cron server to schedule dataflow pipelines.
For simplicity I had used java spark as the web server. Thought now I have decide that spring boot is more appropriate as I would like to add more features to be application.
App Engine works very nice with spring boot, and you can find examples at java examples.
The problem is with dataflow and spring boot. The way dataflow runner works, is that if you do not specify the jars (using the filesToStage flag) to upload it will do it by itself according to the class path. With spark java it was very simple since the uber-jar was created with the standard maven plugin for jar with dependencies. The problem with spring is that the spring maven plugin, creates a jar in a different structure where instead of joining all classes from all jars to one jar, a jar container is created and all other jars are added to this jar. From this special organization dataflow cannot find the needed classes and cannot deploy the job.
I had almost given up, and then i found that spring themselfs had found that there plugin could cause problems. So when you inherit the spring parent pom, you usually add the plugin spring-boot-maven-plugin.
But this plugin is the one causing the trouble. After going over the spring parent pom, I found that they support both the spring plugin and maven-shade-plugin .
So on your pom you need to add the spring parent pom and then the shade plugin which will then pack your jar in a similar way to the assembly plugin with dependencies.
To see the code for this, I have created a branch in my git repository for the schedualing of pipelines: https://github.com/chaimt/schedule-dataflow-appengine/tree/spring_boot