Databricks-Certified-Professional-Data-Engineer試験無料問題集（129題）「Databricks Certified Professional Data Engineer 認定」

出題：1

In order to prevent accidental commits to production data, a senior data engineer has instituted a policy that all development work will reference clones of Delta Lake tables. After testing both deep and shallow clone, development tables are created using shallow clone.
A few weeks after initial table creation, the cloned versions of several tables implemented as Type 1 Slowly Changing Dimension (SCD) stop working. The transaction logs for the source tables show that vacuum was run the day before.
Why are the cloned tables no longer working?

A. The metadata created by the clone operation is referencing data files that were purged as invalid by the vacuum command

B. Running vacuum automatically invalidates any shallow clones of a table; deep clone should always be used when a cloned table will be repeatedly queried.

C. The data files compacted by vacuum are not tracked by the cloned metadata; running refresh on the cloned table will pull in recent changes.

D. Because Type 1 changes overwrite existing records, Delta Lake cannot guarantee data consistency for cloned tables.

正解：A 解答を投票する

解説: (GoShiken メンバーにのみ表示されます)

出題：2

When evaluating the Ganglia Metrics for a given cluster with 3 executor nodes, which indicator would signal proper utilization of the VM's resources?

A. Network I/O never spikes

B. CPU Utilization is around 75%

C. Bytes Received never exceeds 80 million bytes per second

D. The five Minute Load Average remains consistent/flat

E. Total Disk Space remains constant

正解：B 解答を投票する

出題：3

A junior data engineer is migrating a workload from a relational database system to the Databricks Lakehouse. The source system uses a star schema, leveraging foreign key constrains and multi-table inserts to validate records on write.
Which consideration will impact the decisions made by the engineer while migrating this workload?

A. All Delta Lake transactions are ACID compliance against a single table, and Databricks does not enforce foreign key constraints.

B. Foreign keys must reference a primary key field; multi-table inserts must leverage Delta Lake's upsert functionality.

C. Databricks only allows foreign key constraints on hashed identifiers, which avoid collisions in highly- parallel writes.

D. Committing to multiple tables simultaneously requires taking out multiple table locks and can lead to a state of deadlock.

正解：A 解答を投票する

解説: (GoShiken メンバーにのみ表示されます)

出題：4

A team of data engineer are adding tables to a DLT pipeline that contain repetitive expectations for many of the same data quality checks.
One member of the team suggests reusing these data quality rules across all tables defined for this pipeline.
What approach would allow them to do this?

A. Maintain data quality rules in a Delta table outside of this pipeline's target schema, providing the schema name as a pipeline parameter.

B. Maintain data quality rules in a separate Databricks notebook that each DLT notebook of file.

C. Use global Python variables to make expectations visible across DLT notebooks included in the same pipeline.

D. Add data quality constraints to tables in this pipeline using an external job with access to pipeline configuration files.

正解：A 解答を投票する

解説: (GoShiken メンバーにのみ表示されます)

出題：5

Which statement describes Delta Lake Auto Compaction?

A. An asynchronous job runs after the write completes to detect if files could be further compacted; if yes, an optimize job is executed toward a default of 1 GB.

B. Optimized writes use logical partitions instead of directory partitions; because partition boundaries are only represented in metadata, fewer small files are written.

C. Data is queued in a messaging bus instead of committing data directly to memory; all data is committed from the messaging bus in one batch once the job is complete.

D. Before a Jobs cluster terminates, optimize is executed on all tables modified during the most recent job.

E. An asynchronous job runs after the write completes to detect if files could be further compacted; if yes, an optimize job is executed toward a default of 128 MB.

正解：E 解答を投票する

解説: (GoShiken メンバーにのみ表示されます)

出題：6

Which statement describes the default execution mode for Databricks Auto Loader?

A. New files are identified by listing the input directory; the target table is materialized by directory querying all valid files in the source directory.

B. Cloud vendor-specific queue storage and notification services are configured to track newly arriving files; new files are incrementally and impotently into the target Delta Lake table.

C. Webhook trigger Databricks job to run anytime new data arrives in a source directory; new data automatically merged into target tables using rules inferred from the data.

D. New files are identified by listing the input directory; new files are incrementally and idempotently loaded into the target Delta Lake table.

正解：D 解答を投票する

解説: (GoShiken メンバーにのみ表示されます)

出題：7

A data ingestion task requires a one-TB JSON dataset to be written out to Parquet with a target part-file size of 512 MB. Because Parquet is being used instead of Delta Lake, built-in file-sizing features such as Auto- Optimize & Auto-Compaction cannot be used.
Which strategy will yield the best performance without shuffling data?

A. Set spark.sql.adaptive.advisoryPartitionSizeInBytes to 512 MB bytes, ingest the data, execute the narrow transformations, coalesce to 2,048 partitions (1TB*1024*1024/512), and then write to parquet.

B. Set spark.sql.shuffle.partitions to 512, ingest the data, execute the narrow transformations, and then write to parquet.

C. Set spark.sql.shuffle.partitions to 2,048 partitions (1TB*1024*1024/512), ingest the data, execute the narrow transformations, optimize the data by sorting it (which automatically repartitions the data), and then write to parquet.

D. Ingest the data, execute the narrow transformations, repartition to 2,048 partitions (1TB* 1024*1024
/512), and then write to parquet.

E. Set spark.sql.files.maxPartitionBytes to 512 MB, ingest the data, execute the narrow transformations, and then write to parquet.

正解：E 解答を投票する

解説: (GoShiken メンバーにのみ表示されます)

出題：8

A Structured Streaming job deployed to production has been experiencing delays during peak hours of the day. At present, during normal execution, each microbatch of data is processed in less than 3 seconds. During peak hours of the day, execution time for each microbatch becomes very inconsistent, sometimes exceeding
30 seconds. The streaming write is currently configured with a trigger interval of 10 seconds.
Holding all other variables constant and assuming records need to be processed in less than 10 seconds, which adjustment will meet the requirement?

A. The trigger interval cannot be modified without modifying the checkpoint directory; to maintain the current stream state, increase the number of shuffle partitions to maximize parallelism.

B. Increase the trigger interval to 30 seconds; setting the trigger interval near the maximum execution time observed for each batch is always best practice to ensure no records are dropped.

C. Decrease the trigger interval to 5 seconds; triggering batches more frequently allows idle executors to begin processing the next batch while longer running tasks from previous batches finish.

D. Decrease the trigger interval to 5 seconds; triggering batches more frequently may prevent records from backing up and large batches from causing spill.

E. Use the trigger once option and configure a Databricks job to execute the query every 10 seconds; this ensures all backlogged records are processed with each batch.

正解：E 解答を投票する

解説: (GoShiken メンバーにのみ表示されます)

出題：9

A junior data engineer has been asked to develop a streaming data pipeline with a grouped aggregation using DataFramedf. The pipeline needs to calculate the average humidity and average temperature for each non- overlapping five-minute interval. Events are recorded once per minute per device.
Streaming DataFramedfhas the following schema:
"device_id INT, event_time TIMESTAMP, temp FLOAT, humidity FLOAT"
Code block:

Choose the response that correctly fills in the blank within the code block to complete this task.

A. "event_time"

B. window("event_time", "10 minutes").alias("time")

C. to_interval("event_time", "5 minutes").alias("time")

D. lag("event_time", "10 minutes").alias("time")

E. window("event_time", "5 minutes").alias("time")

正解：E 解答を投票する

解説: (GoShiken メンバーにのみ表示されます)

出題：10

A DLT pipeline includes the following streaming tables:
Raw_lot ingest raw device measurement data from a heart rate tracking device.
Bgm_stats incrementally computes user statistics based on BPM measurements from raw_lot.
How can the data engineer configure this pipeline to be able to retain manually deleted or updated records in the raw_iot table while recomputing the downstream table when a pipeline update is run?

A. Set the skipChangeCommits flag to true on bpm_stats

B. Set the SkipChangeCommits flag to true raw_lot

C. Set the pipelines, reset, allowed property to false on bpm_stats

D. Set the pipelines, reset, allowed property to false on raw_iot

正解：D 解答を投票する

解説: (GoShiken メンバーにのみ表示されます)

Databricks-Certified-Professional-Data-Engineer試験無料問題集「Databricks Certified Professional Data Engineer 認定」