CDP-3002試験無料問題集（320題）「Cloudera CDP Data Engineer

出題：1

You need to design your Airflow DAG for data quality checks to be scalable and manageable as the number of datasets and checks grows. How can you achieve this?

A. Leverage external configuration files (e.g., YAML or JSON) to define data quality checks and associated parameters.

B. Implement a modular design using sub-DAGs, where each sub-DAG encapsulates the data quality checks for a specific dataset.

C. Utilize Airflow variables to store configuration details like data source paths and check thresholds.

D. Hardcode all data quality checks and data sources directly within the DAG code.

正解：A,B 解答を投票する

解説: (GoShiken メンバーにのみ表示されます)

出題：2

Which Spark component is responsible for managing the execution of tasks on worker nodes?

A. Spark Core

B. Spark Executor

C. Spark Driver

D. spark SQL

正解：B 解答を投票する

解説: (GoShiken メンバーにのみ表示されます)

出題：3

You want to write the results of a Spark DataFrame operation to a Parquet file for efficient storage and retrieval. What approach can you achieve this efficiently?

A. Use the DataFrame.write.text() method and specify the output format as Parquet

B. Convert the DataFrame to a temporary table and then use HiveQL commands to write to Parquet

C. Use Spark SQL's CREATE TABLE statement with the Parquet format specified

D. Leverage the DataFrame.write.parquet() method with appropriate options

正解：D 解答を投票する

解説: (GoShiken メンバーにのみ表示されます)

出題：4

Discuss the trade-offs between using wide tables (many columns) and narrow tables (few columns) in Spark and the implications for data processing efficiency.

A. Wide tables offer better performance for filtering operations, while narrow tables excel in aggregations.

B. Wide tables require more storage and may lead to performance overhead due to data serialization.

C. Both A and C.

D. Spark can optimize data access patterns for both wide and narrow tables equally.

正解：C 解答を投票する

解説: (GoShiken メンバーにのみ表示されます)

出題：5

What does setting the Spark configuration parameter spark.sql.shuffle.partitions impact?

A. The memory allocation for executor instances

B. The default level of parallelism for joins and aggregations

C. The compression codec used for shuffle files

D. The serialization format of data

正解：B 解答を投票する

解説: (GoShiken メンバーにのみ表示されます)

出題：6

Which operator or feature in Apache Airflow can be used to dynamically adjust the schedule of data quality checks based on the volume of incoming data?

A. A custom PythonOperator that modifies the DAG's schedule interval

B. The BranchPythonOperator to choose between different scheduling paths

C. The Scheduler component with dynamic DAG generation

D. The ExternalTaskSensor to trigger data quality checks based on external events

正解：C 解答を投票する

解説: (GoShiken メンバーにのみ表示されます)

出題：7

You need to optimize the performance of a Spark query that involves joining data from multiple Hive tables. What strategies can you employ to improve efficiency?

A. Increase the number of Spark executors without any further optimization

B. Use broadcast joins for small tables involved in the join operation

C. Pre-partition tables based on the join columns for faster data co-location

D. All of the above

正解：D 解答を投票する

解説: (GoShiken メンバーにのみ表示されます)

出題：8

You want to perform an Iceberg table join in CDP using Spark SQL, but you notice it's much slower than expected. What could be some of the reasons? (Choose two)

A. You're joining on a column with low cardinality (few distinct values).

B. Iceberg version mismatch between Spark and CDP.

C. Spark is using nested loop joins instead of broadcast hash joins due to table sizes.

D. One of the tables isn't partitioned effectively.

E. Spark's dynamic query execution is enabled.

正解：C,D 解答を投票する

出題：9

You are deploying a Spark application on Kubernetes, which requires access to a web I-Jl. You decide to expose it using a Kubernetes service. Which type of service would you typically use to expose the Spark IJI to external traffic?

A. NodePort

B. ExternalName

C. LoadBalancer

D. ClusteriP

正解：A 解答を投票する

解説: (GoShiken メンバーにのみ表示されます)

出題：10

Describe how you would implement a data lineage solution within the Cloudera Data Engineering service to track the origin and flow of data throughout your data pipelines.

A. Utilize custom scripting to track data movement between pipeline steps.

B. Integrate the service with a separate third-party lineage tracking tool.

C. Leverage the integrated lineage tracking capabilities of the Cloudera Data Engineering service.

D. Manually document data lineage within the YAML configuration files.

正解：C 解答を投票する

出題：11

You're developing a Spark application with multiple stages, and you want to ensure that later stages only start processing after all data from the previous stage is complete. How can you achieve this dependency management in Spark?

A. Use explicit rdd.persist() calls in each stage

B. Leverage Spark's lineage tracking and stage boundaries

C. Set appropriate configuration for spark.sql.shuffe.partitions

D. Implement custom logic to track completion of each stage

正解：B 解答を投票する

解説: (GoShiken メンバーにのみ表示されます)

出題：12

You're tasked with deploying a new Airflow DAG to production. What are some key considerations for ensuring a smooth and successful deployment?

A. Utilize infrastructure as code (laC. tools like Terraform to manage Airflow deployment and configuration in a consistent and repeatable manner.

B. Thoroughly test the DAG in a staging environment before deploying it to production.

C. Deploy the DAG directly to the production environment without any testing or staging phase.

D. All of the above

正解：A,B,D 解答を投票する

解説: (GoShiken メンバーにのみ表示されます)

出題：13

Which Airflow operator is best used for executing a Python function as part of a DAG?

A. Dockeroperator

B. SimpleHttpOperator

C. PythonOperator

D. BashOperator

正解：C 解答を投票する

解説: (GoShiken メンバーにのみ表示されます)

出題：14

What does setting the Spark configuration parameter 'spark.sql.shuffle.partitions' impact?
A The default level of parallelism for joins and aggregations

A. The memory allocation for executor instances

B. The compression codec used for shuffle files

C. The serialization format of data

正解：C 解答を投票する

解説: (GoShiken メンバーにのみ表示されます)

CDP-3002試験無料問題集「Cloudera CDP Data Engineer - Certification 認定」