gigl.common.services.DataprocService#

class gigl.common.services.dataproc.DataprocService(project_id: str, region: str)#

Bases: object

A service class that provides methods to interact with Google Cloud Dataproc.

Args:

project_id (str): The ID of the Google Cloud project. region (str): The region where the Dataproc cluster is located.

Methods

__init__

create_cluster

Creates a dataproc cluster

delete_cluster

Deletes a cluster with the given name.

does_cluster_exist

Checks if a cluster with the given name exists.

get_running_job_ids_on_cluster

Retrieves the running job IDs on the specified cluster.

get_submitted_job_ids

Retrieves the job IDs of all active jobs submitted to a specific cluster.

submit_and_wait_scala_spark_job

Submits a Scala Spark job to a Dataproc cluster and waits for its completion.

__init__(project_id: str, region: str) → None#

__weakref__#: list of weak references to the object (if defined)

create_cluster(cluster_spec: dict) → None#

Creates a dataproc cluster

Args:

cluster_spec (dict): A dictionary containing the cluster specification.: For more details, refer to the documentation at: https://cloud.google.com/python/docs/reference/dataproc/latest/google.cloud.dataproc_v1.types.Cluster

Returns:

None

delete_cluster(cluster_name: str) → None#

Deletes a cluster with the given name.

Args:: cluster_name (str): The name of the cluster to delete.
Returns:: None

does_cluster_exist(cluster_name: str) → bool#

Checks if a cluster with the given name exists.

Args:: cluster_name (str): The name of the cluster to check.
Returns:: bool: True if the cluster exists, False otherwise.

get_running_job_ids_on_cluster(cluster_name: str) → List[str]#

Retrieves the running job IDs on the specified cluster.

Args:: cluster_name (str): The name of the cluster.
Returns:: List[str]: The running job IDs on the cluster.

get_submitted_job_ids(cluster_name: str) → List[str]#

Retrieves the job IDs of all active jobs submitted to a specific cluster.

Args:: cluster_name (str): The name of the cluster.
Returns:: List[str]: The job IDs of all active jobs submitted to the cluster.

submit_and_wait_scala_spark_job(cluster_name: str, max_job_duration: timedelta, main_jar_file_uri: Uri, runtime_args: List[str] | None = [], extra_jar_file_uris: List[str] | None = [], properties: dict | None = {}, fail_if_job_already_running_on_cluster: bool | None = True) → None#

Submits a Scala Spark job to a Dataproc cluster and waits for its completion.

Args:: cluster_name (str): The name of the Dataproc cluster. max_job_duration (datetime.timedelta): The maximum duration allowed for the job to run. main_jar_file_uri (Uri): The URI of the main jar file for the Spark job. runtime_args (Optional[List[str]]: Additional runtime arguments for the Spark job. Defaults to []. extra_jar_file_uris (Optional[List[str]]: Additional jar file URIs for the Spark job. Defaults to []. fail_if_job_already_running_on_cluster (Optional[bool]): Whether to fail if there are already running jobs on the cluster. Defaults to True.
Returns:: None

`__init__`
`create_cluster`	Creates a dataproc cluster
`delete_cluster`	Deletes a cluster with the given name.
`does_cluster_exist`	Checks if a cluster with the given name exists.
`get_running_job_ids_on_cluster`	Retrieves the running job IDs on the specified cluster.
`get_submitted_job_ids`	Retrieves the job IDs of all active jobs submitted to a specific cluster.
`submit_and_wait_scala_spark_job`	Submits a Scala Spark job to a Dataproc cluster and waits for its completion.