1. Create a new Spark project
This is a step-by-step introduction intended for beginners.
If you are already familiar with the basics of Conveyor, our how-to guides are probably more appropriate.
In this tutorial you will create and deploy a new batch processing project using Scala and Spark, the most popular distributed data processing framework. The project will be scheduled to run daily and integrated with various cloud services such AWS Glue and S3.
The principles covered by the tutorial will apply to any other development stack.
1.1 Set up your project
If you have not already done so, you will need to set up Conveyor locally: set up the local development environment.
For convenience, let us define some environment variable for the rest of the tutorial.
Use your first name, the name should not contain any special characters like underscore or dashes.
john would be a good name.
Please open your terminal an execute the following commands:
1.1.1 Create the project
Any batch job in any language can run on Conveyor, as long as there is nothing that prevents it being dockerized. For convenience we provide ready-to-go templates for batch jobs using vanilla Python, Spark and other languages and frameworks. Use of these templates is optional.
For this tutorial, we will use the Spark template:
conveyor project create --name $PROJECT_NAME --template spark
For this tutorial, select the following options for your project (other options should be left on their default settings):
2.12(Most recent supported version)
3.0(Most recent supported version)
batch(Unless you want to test out Spark streaming, then enter
batch-and-streaming, see streaming)
aws(If you are using Azure, you can update it accordingly)
It takes a few moments to create the project. The result is a local folder with the same name as the project. Let's have a look inside. This also results in the project being registered in Conveyor and will be visible in the UI.
1.2 Explore the code
Have a look at the folder that was just created and identify the following subfolders.
ls -al | grep '^d'
This should show you the following directories:
.conveyorcontains Conveyor-specific configuration.
dagscontains the Airflow DAGs that will be deployed as part of this project. Here you will define when and how your project will run.
src/main/scalacontains the code that will be executed as part of your project.
src/main/testcontains the unit tests for the source code.
Dockerfiledefines how to package your project code as well as the versions for every dependency (Spark, Python, AWS, ...). We supply our own Spark images that can run on both AWS and Azure, more details about the images can be found here.
1.3 Explore the UI
In the Conveyor UI you can find your project under the projects menu on the left.
Select the projects menu on the left and find your project in the list. Clicking on it will show you the details. Use this opportunity to update the description.