1. Create a new Spark project
This is a step-by-step introduction intended for beginners.
If you are already familiar with the basics of Conveyor, our how-to guides are probably more appropriate.
In this tutorial you will create and deploy a new batch processing project using Scala and Spark, the most popular distributed data processing framework. The project will be scheduled to run daily and integrated with various cloud services such AWS Glue and S3.
The principles covered by the tutorial will apply to any other development stack.
1.1 Set up your project
If you have not already done so, you will need to set up Conveyor locally: set up the local development environment.
For convenience, let us define some environment variable for the rest of the tutorial.
Use your first name, the name should not contain any special characters like underscore or dashes.
For example john
would be a good name.
Please open your terminal an execute the following commands:
export NAME=INSERT_YOUR_NAME
export PROJECT_NAME=$NAME
export ENVIRONMENT_NAME=$NAME
1.1.1 Create the project
Any batch job in any language can run on Conveyor, as long as there is nothing that prevents it being dockerized. For convenience, we provide ready-to-go templates for batch jobs using vanilla Python, Spark and other languages and frameworks. Use of these templates is optional.
For this tutorial, we will use the Spark template:
conveyor project create --name $PROJECT_NAME --template spark
For this tutorial, select the following options for your project (other options should be left on their default settings):
- scala_version:
2.12
(Most recent supported version) - spark_version:
3.0
(Most recent supported version) - conveyor_managed_role:
No
- project_type:
batch
(Unless you want to test out Spark streaming, then enterbatch-and-streaming
, see streaming) - cloud:
aws
(If you are using Azure, you can update it accordingly)
It takes a few moments to create the project. The result is a local folder with the same name as the project. Let's have a look inside. This also results in the project being registered in Conveyor and will be visible in the UI.
1.2 Explore the code
Have a look at the folder that was just created and identify the following subfolders.
cd $PROJECT_NAME
ls -al | grep '^d'
This should show you the following directories:
.conveyor
contains Conveyor-specific configuration.dags
contains the Airflow DAGs that will be deployed as part of this project. Here you will define when and how your project will run.src/main/scala
contains the code that will be executed as part of your project.src/main/test
contains the unit tests for the source code.Dockerfile
defines how to package your project code as well as the versions for every dependency (Spark, Python, AWS, ...). We supply our own Spark images that can run on both AWS and Azure, more details about the images can be found here.
1.3 Explore the UI
In the Conveyor UI you can find your project under the projects menu on the left.
Select the projects menu on the left and find your project in the list. Clicking on it will show you the details. Use this opportunity to update the description.