あなたを合格させるProfessional-Data-Engineerお手軽に試験合格リアルProfessional-Data-Engineer練習問題集で更新されたのは2025年03月23日 [Q187-Q211]

あなたを合格させるProfessional-Data-Engineerお手軽に試験合格リアルProfessional-Data-Engineer練習問題集で更新されたのは2025年03月23日

2025年最新の実際に出ると確認されたで無料Google Professional-Data-Engineer試験問題

Google Professional-Data-Engineer試験を受験するには、候補者はHadoop、Spark、その他のビッグデータフレームワークなどのデータ処理技術に深い理解を持っていることが必要です。また、Python、Java、またはGoなどのプログラミング言語に熟練しており、データ処理パイプラインの設計および開発の経験を持っている必要があります。さらに、候補者は、BigQuery、Dataflow、DataprocなどのGoogle Cloud Platformサービスでの実務経験を持っている必要があります。Google Professional-Data-Engineer試験に合格することは、データ専門家がキャリアを進めたり、Google Cloud上でデータソリューションを管理する能力を証明するために貴重な資産となる可能性があります。

Google Professional-Data-Engineer 認定試験の出題範囲：

トピック	出題範囲
トピック 1	データの取り込みと処理: このトピックでは、データパイプラインの計画、パイプラインの構築、データの取得とインポート、パイプラインの展開と運用化について説明します。
トピック 2	分析用のデータの準備と使用: 視覚化、データ共有、およびデータの評価のためのデータに関する質問が表示される場合があります。
トピック 3	データの保存: このトピックでは、ストレージシステムの選択方法とデータウェアハウスの使用を計画する方法について説明します。さらに、データメッシュの設計方法についても説明します。
トピック 4	データ処理システムの設計: セキュリティとコンプライアンス、信頼性と忠実性、柔軟性と移植性、データ移行のための設計について詳しく説明します。
トピック 5	データワークロードの維持と自動化: リソースの最適化、自動化と再現性の設計、ビジネス要件に応じたワークロードの編成について説明します。最後に、このトピックではプロセスの監視とトラブルシューティング、および障害の認識の維持について説明します。

この試験は、データエンジニアリングの様々な側面に関する候補者の熟練度をテストします。例えば、データ処理システムの設計、システムの構築と運用、データネットワークの最適化とセキュリティ強化、高品質な機械学習モデルの開発などがあります。試験の形式には、多肢選択問題、ケーススタディ、そしてGoogle Cloudテクノロジーの実務的な知識をテストする実践的なシナリオが含まれています。

質問 # 187
As your organization expands its usage of GCP, many teams have started to create their own projects.
Projects are further multiplied to accommodate different stages of deployments and target audiences. Each project requires unique access control configurations. The central IT team needs to have access to all projects.
Furthermore, data from Cloud Storage buckets and BigQuery datasets must be shared for use in other projects in an ad hoc way. You want to simplify access control management by minimizing the number of policies.
Which two steps should you take? (Choose two.)

A. For each Cloud Storage bucket or BigQuery dataset, decide which projects need access. Find all the active members who have access to these projects, and create a Cloud IAM policy to grant access to all these users.
B. Create distinct groups for various teams, and specify groups in Cloud IAM policies.
C. Use Cloud Deployment Manager to automate access provision.
D. Introduce resource hierarchy to leverage access control policy inheritance.
E. Only use service accounts when sharing data for Cloud Storage buckets and BigQuery datasets.

正解：B、C

解説：
Explanation

質問 # 188
Your company is using WHILECARD tables to query data across multiple tables with similar names. The SQL statement is currently failing with the following error:
# Syntax error : Expected end of statement but got "-" at [4:11] SELECT age FROM bigquery-public-data.noaa_gsod.gsod WHERE age != 99 AND_TABLE_SUFFIX = `1929' ORDER BY age DESC Which table name will make the SQL statement work correctly?

A. `bigquery-public-data.noaa_gsod.gsod`
B. bigquery-public-data.noaa_gsod.gsod*
C. `bigquery-public-data.noaa_gsod.gsod'*
D. `bigquery-public-data.noaa_gsod.gsod*`

正解：B

質問 # 189
You have an Oracle database deployed in a VM as part of a Virtual Private Cloud (VPC) network. You want to replicate and continuously synchronize 50 tables to BigQuery. You want to minimize the need to manage infrastructure. What should you do?

A. Create a Datastream service from Oracle to BigQuery, use a private connectivity configuration to the same VPC network, and a connection profile to BigQuery.
B. Deploy Apache Kafka in the same VPC network, use Kafka Connect Oracle Change Data Capture (CDC), and Dataflow to stream the Kafka topic to BigQuery.
D O Deploy Apache Kafka in the same VPC network, use Kafka Connect Oracle change data capture (CDC), and the Kafka Connect Google BigQuery Sink Connector.
C. Create a Pub/Sub subscription to write to BigQuery directly Deploy the Debezium Oracle connector to capture changes in the Oracle database, and sink to the Pub/Sub topic.

正解：A

解説：
Datastream is a serverless, scalable, and reliable service that enables you to stream data changes from Oracle and MySQL databases to Google Cloud services such as BigQuery, Cloud SQL, Google Cloud Storage, and Cloud Pub/Sub. Datastream captures and streams database changes using change data capture (CDC) technology. Datastream supports private connectivity to the source and destination systems using VPC networks. Datastream also provides a connection profile to BigQuery, which simplifies the configuration and management of the data replication. References:
* Datastream overview
* Creating a Datastream stream
* Using Datastream with BigQuery

質問 # 190
You are selecting services to write and transform JSON messages from Cloud Pub/Sub to BigQuery for a data pipeline on Google Cloud. You want to minimize service costs. You also want to monitor and accommodate input data volume that will vary in size with minimal manual intervention. What should you do?

A. Use Cloud Dataflow to run your transformations. Monitor the job system lag with Stackdriver. Use the
default autoscaling setting for worker instances.
B. Use Cloud Dataproc to run your transformations. Use the diagnose command to generate an operational output archive. Locate the bottleneck and adjust cluster resources.
C. Use Cloud Dataflow to run your transformations. Monitor the total execution time for a sampling of jobs. Configure the job to use non-default Compute Engine machine types when needed.
D. Use Cloud Dataproc to run your transformations. Monitor CPU utilization for the cluster. Resize the number of worker nodes in your cluster via the command line.

正解：B

質問 # 191
You recently deployed several data processing jobs into your Cloud Composer 2 environment. You notice that some tasks are failing in Apache Airflow. On the monitoring dashboard, you see an increase in the total workers' memory usage, and there were worker pod evictions. You need to resolve these errors. What should you do?
Choose 2 answers

A. Increase the directed acyclic graph (DAG) file parsing interval.
B. Increase the memory available to the Airflow triggerer.
C. Increase the Cloud Composer 2 environment size from medium to large.
D. Increase the maximum number of workers and reduce worker concurrency.
E. Increase the memory available to the Airflow workers.

正解：D、E

解説：
To resolve issues related to increased memory usage and worker pod evictions in your Cloud Composer 2 environment, the following steps are recommended:
Increase Memory Available to Airflow Workers:
By increasing the memory allocated to Airflow workers, you can handle more memory-intensive tasks, reducing the likelihood of pod evictions due to memory limits.
Increase Maximum Number of Workers and Reduce Worker Concurrency:
Increasing the number of workers allows the workload to be distributed across more pods, preventing any single pod from becoming overwhelmed.
Reducing worker concurrency limits the number of tasks that each worker can handle simultaneously, thereby lowering the memory consumption per worker.
Steps to Implement:
Increase Worker Memory:
Modify the configuration settings in Cloud Composer to allocate more memory to Airflow workers. This can be done through the environment configuration settings.
Adjust Worker and Concurrency Settings:
Increase the maximum number of workers in the Cloud Composer environment settings.
Reduce the concurrency setting for Airflow workers to ensure that each worker handles fewer tasks at a time, thus consuming less memory per worker.
Reference:
Cloud Composer Worker Configuration
Scaling Airflow Workers

質問 # 192
Your startup has never implemented a formal security policy. Currently, everyone in the company has access to the datasets stored in Google BigQuery. Teams have freedom to use the service as they see fit, and they have not documented their use cases. You have been asked to secure the data warehouse. You need to discover what everyone is doing. What should you do first?

A. Get the identity and access management IIAM) policy of each table
B. Use Google Stackdriver Audit Logs to review data access.
C. Use the Google Cloud Billing API to see what account the warehouse is being billed to.
D. Use Stackdriver Monitoring to see the usage of BigQuery query slots.

正解：B

解説：
First we need to know who is accessing what then we can create suitable policies. Stackdriver is used to track access logs for Bigquery.

質問 # 193
You work for a manufacturing company that sources up to 750 different components, each from a different supplier. You've collected a labeled dataset that has on average 1000 examples for each unique component.
Your team wants to implement an app to help warehouse workers recognize incoming components based on a photo of the component. You want to implement the first working version of this app (as Proof-Of-Concept) within a few working days. What should you do?

A. Use Cloud Vision AutoML, but reduce your dataset twice.
B. Train your own image recognition model leveraging transfer learning techniques.
C. Use Cloud Vision AutoML with the existing dataset.
D. Use Cloud Vision API by providing custom labels as recognition hints.

正解：C

質問 # 194
Which of the following is not possible using primitive roles?

A. Give GroupA owner access and GroupB editor access for all datasets in a project.
B. Give a user viewer access to BigQuery and owner access to Google Compute Engine instances.
C. Give UserA owner access and UserB editor access for all datasets in a project.
D. Give a user access to view all datasets in a project, but not run queries on them.

正解：D

解説：
Explanation
Primitive roles can be used to give owner, editor, or viewer access to a user or group, but they can't be used to separate data access permissions from job-running permissions.
Reference: https://cloud.google.com/bigquery/docs/access-control#primitive_iam_roles

質問 # 195
You are designing a data processing pipeline. The pipeline must be able to scale automatically as load increases. Messages must be processed at least once and must be ordered within windows of 1 hour. How should you design the solution?

A. Use Apache Kafka for message ingestion and use Cloud Dataflow for streaming analysis.
B. Use Apache Kafka for message ingestion and use Cloud Dataproc for streaming analysis.
C. Use Cloud Pub/Sub for message ingestion and Cloud Dataproc for streaming analysis.
D. Use Cloud Pub/Sub for message ingestion and Cloud Dataflow for streaming analysis.

正解：C

解説：
Explanation

質問 # 196
Which of the following is NOT one of the three main types of triggers that Dataflow supports?

A. Trigger that is a combination of other triggers
B. Trigger based on element count
C. Trigger based on element size in bytes
D. Trigger based on time

正解：C

解説：
There are three major kinds of triggers that Dataflow supports: 1. Time-based triggers 2.
Data-driven triggers. You can set a trigger to emit results from a window when that window has received a certain number of data elements. 3. Composite triggers. These triggers combine multiple time-based or data-driven triggers in some logical way Reference: https://cloud.google.com/dataflow/model/triggers

質問 # 197
You want to store your team's shared tables in a single dataset to make data easily accessible to various analysts. You want to make this data readable but unmodifiable by analysts. At the same time, you want to provide the analysts with individual workspaces in the same project, where they can create and store tables for their own use, without the tables being accessible by other analysts. What should you do?

A. Give analysts the BigQuery Data Viewer role on the shared dataset Create one other dataset and give the analysts the BigQuery Data Editor role on that dataset.
B. Give analysts the BigQuery Data Viewer role at the project level Create a dataset for each analyst, and give each analyst the BigQuery Data Editor role at the project level.
C. Give analysts the BigQuery Data Viewer role on the shared dataset. Create a dataset for each analyst, and give each analyst the BigQuery Data Editor role at the dataset level for their assigned dataset
D. Give analysts the BigQuery Data Viewer role at the project level Create one other dataset, and give the analysts the BigQuery Data Editor role on that dataset.

正解：C

解説：
The BigQuery Data Viewer role allows users to read data and metadata from tables and views, but not to modify or delete them. By giving analysts this role on the shared dataset, you can ensure that they can access the data for analysis, but not change it. The BigQuery Data Editor role allows users to create, update, and delete tables and views, as well as read and write data. By giving analysts this role at the dataset level for their assigned dataset, you can provide them with individual workspaces where they can store their own tables and views, without affecting the shared dataset or other analysts' datasets. This way, you can achieve both data protection and data isolation for your team. Reference:
BigQuery IAM roles and permissions
Basic roles and permissions

質問 # 198
Your company is streaming real-time sensor data from their factory floor into Bigtable and they have noticed extremely poor performance. How should the row key be redesigned to improve Bigtable performance on queries that populate real-time dashboards?

A. Use a row key of the form <timestamp>.
B. Use a row key of the form <sensorid>.
C. Use a row key of the form >#<sensorid>#<timestamp>.
D. Use a row key of the form <timestamp>#<sensorid>.

正解：A

質問 # 199
You are building a model to make clothing recommendations. You know a user's fashion preference is
likely to change over time, so you build a data pipeline to stream new data back to the model as it
becomes available. How should you use this data to train the model?

A. Train on the new data while using the existing data as your test set.
B. Train on the existing data while using the new data as your test set.
C. Continuously retrain the model on just the new data.
D. Continuously retrain the model on a combination of existing data and the new data.

正解：A

質問 # 200
You are designing storage for very large text files for a data pipeline on Google Cloud. You want to support ANSI SQL queries. You also want to support compression and parallel load from the input locations using Google recommended practices. What should you do?

A. Compress text files to gzip using the Grid Computing Tools. Use BigQuery for storage and query.
B. Transform text files to compressed Avro using Cloud Dataflow. Use BigQuery for storage and query.
C. Transform text files to compressed Avro using Cloud Dataflow. Use Cloud Storage and BigQuery permanent linked tables for query.
D. Compress text files to gzip using the Grid Computing Tools. Use Cloud Storage, and then import into Cloud Bigtable for query.

正解：D

解説：
Explanation/Reference:

質問 # 201
You are deploying a new storage system for your mobile application, which is a media streaming service. You decide the best fit is Google Cloud Datastore. You have entities with multiple properties, some of which can take on multiple values. For example, in the entity 'Movie' the property 'actors' and the property 'tags' have multiple values but the property 'date released' does not. A typical query would ask for all movies with actor=<actorname> ordered by date_released or all movies with tag=Comedy ordered by date_released. How should you avoid a combinatorial explosion in the number of indexes?

A. Option C
B. Option D
C. Option A
D. Option B.

正解：C

質問 # 202
You're using Bigtable for a real-time application, and you have a heavy load that is a mix of read and writes.
You've recently identified an additional use case and need to perform hourly an analytical job to calculate certain statistics across the whole database. You need to ensure both the reliability of your production application as well as the analytical workload.
What should you do?

A. Increase the size of your existing cluster twice and execute your analytics workload on your new resized cluster.
B. Add a second cluster to an existing instance with a multi-cluster routing, use live-traffic app profile for your regular workload and batch-analytics profile for the analytics workload.
C. Export Bigtable dump to GCS and run your analytical job on top of the exported files.
D. Add a second cluster to an existing instance with a single-cluster routing, use live-traffic app profile for your regular workload and batch-analytics profile for the analytics workload.

正解：B

質問 # 203
Which of these is NOT a way to customize the software on Dataproc cluster instances?

A. Modify configuration files using cluster properties
B. Log into the master node and make changes from there
C. Configure the cluster using Cloud Deployment Manager
D. Set initialization actions

正解：C

解説：
You can access the master node of the cluster by clicking the SSH button next to it in the Cloud Console.
You can easily use the --properties option of the dataproc command in the Google Cloud SDK to modify many common configuration files when creating a cluster.
When creating a Cloud Dataproc cluster, you can specify initialization actions in executables and/or scripts that Cloud Dataproc will run on all nodes in your Cloud Dataproc cluster immediately after the cluster is set up. [https://cloud.google.com/dataproc/docs/concepts/configuring-clusters/init-actions]

質問 # 204
You work for a large ecommerce company. You store your customers order data in Bigtable. You have a garbage collection policy set to delete the data after 30 days and the number of versions is set to 1. When the data analysts run a query to report total customer spending, the analysts sometimes see customer data that is older than 30 days. You need to ensure that the analysts do not see customer data older than 30 days while minimizing cost and overhead. What should you do?

A. Use a timestamp range filter in the query to fetch the customer's data for a specific range.
B. Set the expiring values of the column families to 29 days and keep the number of versions to 1.
C. Set the expiring values of the column families to 30 days and set the number of versions to 2.
D. Schedule a job daily to scan the data in the table and delete data older than 30 days.

正解：A

解説：
By using a timestamp range filter in the query, you can ensure that the analysts only see the customer data that is within the desired time range, regardless of the garbage collection policy1. This option is the most cost-effective and simple way to avoid fetching data that is marked for deletion by garbage collection, as it does not require changing the existing policy or creating additional jobs. You can use the Bigtable client libraries or the cbt CLI to apply a timestamp range filter to your read requests2.
Option A is not effective, as it increases the number of versions to 2, which may cause more data to be retained and increase the storage costs. Option C is not reliable, as it reduces the expiring values to 29 days, which may not match the actual data arrival and usage patterns. Option D is not efficient, as it requires scheduling a job daily to scan and delete the data, which may incur additional overhead and complexity. Moreover, none of these options guarantee that the data older than 30 days will be immediately deleted, as garbage collection is an asynchronous process that can take up to a week to remove the data3. References:
* 1: Filters | Cloud Bigtable Documentation | Google Cloud
* 2: Read data | Cloud Bigtable Documentation | Google Cloud
* 3: Garbage collection overview | Cloud Bigtable Documentation | Google Cloud

質問 # 205
You are choosing a NoSQL database to handle telemetry data submitted from millions of Internet-of-Things (IoT) devices. The volume of data is growing at 100 TB per year, and each data entry has about 100 attributes. The data processing pipeline does not require atomicity, consistency, isolation, and durability (ACID). However, high availability and low latency are required.
You need to analyze the data by querying against individual fields. Which three databases meet your requirements? (Choose three.)

A. HBase
B. HDFS with Hive
C. Redis
D. MongoDB
E. MySQL
F. Cassandra

正解：A、B、D

質問 # 206
You have a data stored in BigQuery. The data in the BigQuery dataset must be highly available. You need to define a storage, backup, and recovery strategy of this data that minimizes cost. How should you configure the BigQuery table?

A. Set the BigQuery dataset to be multi-regional. In the event of an emergency, use a point-in-time snapshot to recover the data.
B. Set the BigQuery dataset to be regional. Create a scheduled query to make copies of the data to tables suffixed with the time of the backup. In the event of an emergency, use the backup copy of the table.
C. Set the BigQuery dataset to be regional. In the event of an emergency, use a point-in-time snapshot to recover the data.
D. Set the BigQuery dataset to be multi-regional. Create a scheduled query to make copies of the data to tables suffixed with the time of the backup. In the event of an emergency, use the backup copy of the table.

正解：B

質問 # 207
You launched a new gaming app almost three years ago. You have been uploading log files from the previous day to a separate Google BigQuery table with the table name format LOGS_yyyymmdd. You have been using table wildcard functions to generate daily and monthly reports for all time ranges. Recently, you discovered that some queries that cover long date ranges are exceeding the limit of 1,000 tables and failing. How can you resolve this issue?

A. Create separate views to cover each month, and query from these views
B. Convert the sharded tables into a single partitioned table
C. Enable query caching so you can cache data from previous months
D. Convert all daily log tables into date-partitioned tables

正解：D

質問 # 208
Your car factory is pushing machine measurements as messages into a Pub/Sub topic in your Google Cloud project. A Dataflow streaming job. that you wrote with the Apache Beam SDK, reads these messages, sends acknowledgment lo Pub/Sub. applies some custom business logic in a Doffs instance, and writes the result to BigQuery. You want to ensure that if your business logic fails on a message, the message will be sent to a Pub/Sub topic that you want to monitor for alerting purposes. What should you do?

A. Create a snapshot of your Pub/Sub pull subscription. Use Cloud Monitoring to monitor the snapshot/numessages metric on this snapshot.
B. Enable retaining of acknowledged messages in your Pub/Sub pull subscription. Use Cloud Monitoring to monitor the subscription/num_retained_acked_messages metric on this subscription.
C. Use an exception handling block in your Data Flow's Doffs code to push the messages that failed to be transformed through a side output and to a new Pub/Sub topic. Use Cloud Monitoring to monitor the topic/num_jnacked_messages_by_region metric on this new topic.
D. Enable dead lettering in your Pub/Sub pull subscription, and specify a new Pub/Sub topic as the dead letter topic. Use Cloud Monitoring to monitor the subscription/dead_letter_message_count metric on your pull subscription.

正解：D

解説：
To ensure that messages failing to process in your Dataflow job are sent to a Pub/Sub topic for monitoring and alerting, the best approach is to use Pub/Sub's dead-letter topic feature. Here's why option C is the best choice:
Dead-Letter Topic:
Pub/Sub's dead-letter topic feature allows messages that fail to be processed successfully to be redirected to a specified topic. This ensures that these messages are not lost and can be reviewed for debugging and alerting purposes.
Monitoring and Alerting:
By specifying a new Pub/Sub topic as the dead-letter topic, you can use Cloud Monitoring to track metrics such as subscription/dead_letter_message_count, providing visibility into the number of failed messages.
This allows you to set up alerts based on these metrics to notify the appropriate teams when failures occur.
Steps to Implement:
Enable Dead-Letter Topic:
Configure your Pub/Sub pull subscription to enable dead lettering and specify the new Pub/Sub topic for dead-letter messages.
Set Up Monitoring:
Use Cloud Monitoring to monitor the subscription/dead_letter_message_count metric on your pull subscription.
Configure alerts based on this metric to notify the team of any processing failures.
Reference:
Pub/Sub Dead Letter Policy
Cloud Monitoring with Pub/Sub

質問 # 209
You have several Spark jobs that run on a Cloud Dataproc cluster on a schedule. Some of the jobs run in sequence, and some of the jobs run concurrently. You need to automate this process. What should you do?

A. Create a Cloud Dataproc Workflow Template
B. Create an initialization action to execute the jobs
C. Create a Directed Acyclic Graph in Cloud Composer
D. Create a Bash script that uses the Cloud SDK to create a cluster, execute jobs, and then tear down the cluster

正解：A

解説：
Explanation/Reference: https://cloud.google.com/dataproc/docs/concepts/workflows/using-workflows

質問 # 210
Your company receives both batch- and stream-based event data. You want to process the data using Google Cloud Dataflow over a predictable time period. However, you realize that in some instances data can arrive late or out of order. How should you design your Cloud Dataflow pipeline to handle data that is late or out of order?

A. Set a single global window to capture all the data.
B. Set sliding windows to capture all the lagged data.
C. Ensure every datasource type (stream or batch) has a timestamp, and use the timestamps to define the logic for lagged data.
D. Use watermarks and timestamps to capture the lagged data.

正解：D

解説：
A watermark is a threshold that indicates when Dataflow expects all of the data in a window to have arrived. If new data arrives with a timestamp that's in the window but older than the watermark, the data is considered late data.

質問 # 211
......

Professional-Data-Engineerリアル試験問題解答は無料：https://www.goshiken.com/Google/Professional-Data-Engineer-mondaishu.html

Professional-Data-Engineer試験問題、リアルProfessional-Data-Engineer練習問題集：https://drive.google.com/open?id=1GRahuaVl9te4dDuuQlhzSCHfIVTK4hd7

関するブログ

もっと

Professional-Data-Engineer無料問題集