2024年最新の実際に出るProfessional-Data-Engineer問題集テストエンジン試験問題はここにある
更新された公式資料はProfessional-Data-Engineer認証済みのProfessional-Data-Engineer問題集PDF
Google Professional-Data-Engineer試験に合格するには、候補者はデータエンジニアリングの概念とテクニック、およびGoogle Cloudプラットフォームでの実践的な経験を確実に理解する必要があります。安全でスケーラブルで効率的なデータ処理システムを設計および実装し、必要に応じてこれらのシステムをトラブルシューティングおよび最適化する機能を持つことができなければなりません。この試験は挑戦的で包括的ですが、特にGoogle Cloudプラットフォームでの協力に興味がある人にとっては、データエンジニアリングの多くのキャリアの機会を開くことができます。
質問 # 65
Which Java SDK class can you use to run your Dataflow programs locally?
- A. DirectPipelineRunner
- B. MachineRunner
- C. LocalPipelineRunner
- D. LocalRunner
正解:A
解説:
DirectPipelineRunner allows you to execute operations in the pipeline directly, without any optimization. Useful for small local execution and tests Reference: https://cloud.google.com/dataflow/java- sdk/JavaDoc/com/google/cloud/dataflow/sdk/runners/DirectPipelineRunner
質問 # 66
You are implementing workflow pipeline scheduling using open source-based tools and Google Kubernetes Engine (GKE). You want to use a Google managed service to simplify and automate the task. You also want to accommodate Shared VPC networking considerations. What should you do?
- A. Use Cloud Composer in a Shared VPC configuration. Place the Cloud Composer resources in the service project.
- B. Use Dataflow for your workflow pipelines. Use Cloud Run triggers for scheduling.
- C. Use Dataflow for your workflow pipelines. Use shell scripts to schedule workflows.
- D. Use Cloud Composer in a Shared VPC configuration. Place the Cloud Composer resources in the host project.
正解:A
解説:
Shared VPC requires that you designate a host project to which networks and subnetworks belong and a service project, which is attached to the host project. When Cloud Composer participates in a Shared VPC, the Cloud Composer environment is in the service project. Reference:
https://cloud.google.com/composer/docs/how-to/managing/configuring-shared-vpc
質問 # 67
You maintain ETL pipelines. You notice that a streaming pipeline running on Dataflow is taking a long time to process incoming data, which causes output delays. You also noticed that the pipeline graph was automatically optimized by Dataflow and merged into one step. You want to identify where the potential bottleneck is occurring. What should you do?
- A. Insert output sinks after each key processing step, and observe the writing throughput of each block.
- B. Verify that the Dataflow service accounts have appropriate permissions to write the processed data to the output sinks
- C. Log debug information in each ParDo function, and analyze the logs at execution time.
- D. Insert a Reshuffle operation after each processing step, and monitor the execution details in the Dataflow console.
正解:D
解説:
A Reshuffle operation is a way to force Dataflow to split the pipeline into multiple stages, which can help isolate the performance of each step and identify bottlenecks. By monitoring the execution details in the Dataflow console, you can see the time, CPU, memory, and disk usage of each stage, as well as the number of elements and bytes processed. This can help you diagnose where the pipeline is slowing down and optimize it accordingly. References:
* 1: Reshuffling your data
* 2: Monitoring pipeline performance using the Dataflow monitoring interface
* 3: Optimizing pipeline performance
質問 # 68
You need to create a new transaction table in Cloud Spanner that stores product sales data. You are deciding what to use as a primary key. From a performance perspective, which strategy should you choose?
- A. A random universally unique identifier number (version 4 UUID)
- B. The current epoch time
- C. A concatenation of the product name and the current epoch time
- D. The original order identification number from the sales system, which is a monotonically increasing integer
正解:A
質問 # 69
You have an Apache Kafka cluster on-prem with topics containing web application logs. You need to replicate the data to Google Cloud for analysis in BigQuery and Cloud Storage. The preferred replication method is mirroring to avoid deployment of Kafka Connect plugins.
What should you do?
- A. Deploy a Kafka cluster on GCE VM Instances. Configure your on-prem cluster to mirror your topics to the cluster running in GCE. Use a Dataproc cluster or Dataflow job to read from Kafka and write to GCS.
- B. Deploy the PubSub Kafka connector to your on-prem Kafka cluster and configure PubSub as a Sink connector. Use a Dataflow job to read from PubSub and write to GCS.
- C. Deploy the PubSub Kafka connector to your on-prem Kafka cluster and configure PubSub as a Source connector. Use a Dataflow job to read from PubSub and write to GCS.
- D. Deploy a Kafka cluster on GCE VM Instances with the PubSub Kafka connector configured as a Sink connector. Use a Dataproc cluster or Dataflow job to read from Kafka and write to GCS.
正解:A
解説:
Explanation/Reference:
質問 # 70
You are designing storage for two relational tables that are part of a 10-TB database on Google Cloud. You want to support transactions that scale horizontally. You also want to optimize data for range queries on non-key columns. What should you do?
- A. Use Cloud SQL for storage. Use Cloud Dataflow to transform data to support query patterns.
- B. Use Cloud Spanner for storage. Use Cloud Dataflow to transform data to support query patterns.
- C. Use Cloud SQL for storage. Add secondary indexes to support query patterns.
- D. Use Cloud Spanner for storage. Add secondary indexes to support query patterns.
正解:D
解説:
Spanner allows transaction tables to scale horizontally and secondary indexes for range queries.
質問 # 71
You want to archive data in Cloud Storage. Because some data is very sensitive, you want to use the "Trust No One" (TNO) approach to encrypt your data to prevent the cloud provider staff from decrypting your dat
a. What should you do?
- A. Use gcloud kms keys create to create a symmetric key. Then use gcloud kms encrypt to encrypt each archival file with the key and unique additional authenticated data (AAD). Use gsutil cp to upload each encrypted file to the Cloud Storage bucket, and keep the AAD outside of Google Cloud.
- B. Specify customer-supplied encryption key (CSEK) in the .boto configuration file. Use gsutil cp to upload each archival file to the Cloud Storage bucket. Save the CSEK in Cloud Memorystore as permanent storage of the secret.
- C. Specify customer-supplied encryption key (CSEK) in the .boto configuration file. Use gsutil cp to upload each archival file to the Cloud Storage bucket. Save the CSEK in a different project that only the security team can access.
- D. Use gcloud kms keys create to create a symmetric key. Then use gcloud kms encrypt to encrypt each archival file with the key. Use gsutil cp to upload each encrypted file to the Cloud Storage bucket. Manually destroy the key previously used for encryption, and rotate the key once and rotate the key once.
正解:D
質問 # 72
You are developing an application that uses a recommendation engine on Google Cloud. Your solution should display new videos to customers based on past views. Your solution needs to generate labels for the entities in videos that the customer has viewed. Your design must be able to provide very fast filtering suggestions based on data from other customer preferences on several TB of data. What should you do?
- A. Build an application that calls the Cloud Video Intelligence API to generate labels. Store data in Cloud SQL, and join and filter the predicted labels to match the user's viewing history to generate preferences.
- B. Build and train a complex classification model with Spark MLlib to generate labels and filter the results.
Deploy the models using Cloud Dataproc. Call the model from your application. - C. Build and train a classification model with Spark MLlib to generate labels. Build and train a second classification model with Spark MLlib to filter results to match customer preferences. Deploy the models using Cloud Dataproc. Call the models from your application.
- D. Build an application that calls the Cloud Video Intelligence API to generate labels. Store data in Cloud Bigtable, and filter the predicted labels to match the user's viewing history to generate preferences.
正解:D
質問 # 73
MJTelco Case Study
Company Overview
MJTelco is a startup that plans to build networks in rapidly growing, underserved markets around the world. The company has patents for innovative optical communications hardware. Based on these patents, they can create many reliable, high-speed backbone links with inexpensive hardware.
Company Background
Founded by experienced telecom executives, MJTelco uses technologies originally developed to overcome communications challenges in space. Fundamental to their operation, they need to create a distributed data infrastructure that drives real-time analysis and incorporates machine learning to continuously optimize their topologies. Because their hardware is inexpensive, they plan to overdeploy the network allowing them to account for the impact of dynamic regional politics on location availability and cost.
Their management and operations teams are situated all around the globe creating many-to-many relationship between data consumers and provides in their system. After careful consideration, they decided public cloud is the perfect environment to support their needs.
Solution Concept
MJTelco is running a successful proof-of-concept (PoC) project in its labs. They have two primary needs:
Scale and harden their PoC to support significantly more data flows generated when they ramp to more
than 50,000 installations.
Refine their machine-learning cycles to verify and improve the dynamic models they use to control
topology definition.
MJTelco will also use three separate operating environments - development/test, staging, and production
- to meet the needs of running experiments, deploying new features, and serving production customers.
Business Requirements
Scale up their production environment with minimal cost, instantiating resources when and where
needed in an unpredictable, distributed telecom user community.
Ensure security of their proprietary data to protect their leading-edge machine learning and analysis.
Provide reliable and timely access to data for analysis from distributed research workers
Maintain isolated environments that support rapid iteration of their machine-learning models without
affecting their customers.
Technical Requirements
Ensure secure and efficient transport and storage of telemetry data
Rapidly scale instances to support between 10,000 and 100,000 data providers with multiple flows
each.
Allow analysis and presentation against data tables tracking up to 2 years of data storing approximately
100m records/day
Support rapid iteration of monitoring infrastructure focused on awareness of data pipeline problems
both in telemetry flows and in production learning cycles.
CEO Statement
Our business model relies on our patents, analytics and dynamic machine learning. Our inexpensive hardware is organized to be highly reliable, which gives us cost advantages. We need to quickly stabilize our large distributed data pipelines to meet our reliability and capacity commitments.
CTO Statement
Our public cloud services must operate as advertised. We need resources that scale and keep our data secure. We also need environments in which our data scientists can carefully study and quickly adapt our models. Because we rely on automation to process our data, we also need our development and test environments to work as we iterate.
CFO Statement
The project is too large for us to maintain the hardware and software required for the data and analysis.
Also, we cannot afford to staff an operations team to monitor so many data feeds, so we will rely on automation and infrastructure. Google Cloud's machine learning will allow our quantitative researchers to work on our high-value problems instead of problems with our data pipelines.
You create a new report for your large team in Google Data Studio 360. The report uses Google BigQuery as its data source. It is company policy to ensure employees can view only the data associated with their region, so you create and populate a table for each region. You need to enforce the regional access policy to the data.
Which two actions should you take? (Choose two.)
- A. Adjust the settings for each view to allow a related region-based security group view access.
- B. Ensure all the tables are included in global dataset.
- C. Ensure each table is included in a dataset for a region.
- D. Adjust the settings for each table to allow a related region-based security group view access.
- E. Adjust the settings for each dataset to allow a related region-based security group view access.
正解:A、C
質問 # 74
Your United States-based company has created an application for assessing and responding to user actions.
The primary table's data volume grows by 250,000 records per second. Many third parties use your application's APIs to build the functionality into their own frontend applications. Your application's APIs should comply with the following requirements:
* Single global endpoint
* ANSI SQL support
* Consistent access to the most up-to-date data
What should you do?
- A. Implement Cloud SQL for PostgreSQL with the master in Norht America and read replicas in Asia and Europe.
- B. Implement Cloud Bigtable with the primary cluster in North America and secondary clusters in Asia and Europe.
- C. Implement Cloud Spanner with the leader in North America and read-only replicas in Asia and Europe.
- D. Implement BigQuery with no region selected for storage or processing.
正解:C
質問 # 75
Which SQL keyword can be used to reduce the number of columns processed by BigQuery?
- A. SELECT
- B. LIMIT
- C. BETWEEN
- D. WHERE
正解:A
解説:
Explanation
SELECT allows you to query specific columns rather than the whole table.
LIMIT, BETWEEN, and WHERE clauses will not reduce the number of columns processed by BigQuery.
Reference:
https://cloud.google.com/bigquery/launch-checklist#architecture_design_and_development_checklist
質問 # 76
You are operating a Cloud Dataflow streaming pipeline. The pipeline aggregates events from a Cloud Pub/Sub subscription source, within a window, and sinks the resulting aggregation to a Cloud Storage bucket. The source has consistent throughput. You want to monitor an alert on behavior of the pipeline with Cloud Stackdriver to ensure that it is processing data. Which Stackdriver alerts should you create?
- A. An alert based on an increase of instance/storage/used_bytes for the source and a rate of change decrease of subscription/num_undelivered_messages for the destination
- B. An alert based on an increase of subscription/num_undelivered_messages for the source and a rate of change decrease of instance/storage/used_bytes for the destination
- C. An alert based on a decrease of subscription/num_undelivered_messages for the source and a rate of change increase of instance/storage/used_bytes for the destination
- D. An alert based on a decrease of instance/storage/used_bytes for the source and a rate of change increase of subscription/num_undelivered_messages for the destination
正解:B
質問 # 77
You want to archive data in Cloud Storage. Because some data is very sensitive, you want to use the "Trust No One" (TNO) approach to encrypt your data to prevent the cloud provider staff from decrypting your data.
What should you do?
- A. Use gcloud kms keys create to create a symmetric key. Then use gcloud kms encrypt to encrypt each archival file with the key and unique additional authenticated data (AAD). Use gsutil cp to upload each encrypted file to the Cloud Storage bucket, and keep the AAD outside of Google Cloud.
- B. Use gcloud kms keys create to create a symmetric key. Then use gcloud kms encrypt to encrypt each archival file with the key. Use gsutil cp to upload each encrypted file to the Cloud Storage bucket.
Manually destroy the key previously used for encryption, and rotate the key once and rotate the key once. - C. Specify customer-supplied encryption key (CSEK) in the .boto configuration file. Use gsutil cp to upload each archival file to the Cloud Storage bucket. Save the CSEK in Cloud Memorystore as permanent storage of the secret.
- D. Specify customer-supplied encryption key (CSEK) in the .boto configuration file. Use gsutil cp to upload each archival file to the Cloud Storage bucket. Save the CSEK in a different project that only the security team can access.
正解:B
質問 # 78
Which of the following job types are supported by Cloud Dataproc (select 3 answers)?
- A. Hive
- B. Spark
- C. Pig
- D. YARN
正解:A、B、C
解説:
Cloud Dataproc provides out-of-the box and end-to-end support for many of the most popular job types, including Spark, Spark SQL, PySpark, MapReduce, Hive, and Pig jobs.
Reference:
https://cloud.google.com/dataproc/docs/resources/faq#what_type_of_jobs_can_i_run
質問 # 79
You need to choose a database to store time series CPU and memory usage for millions of computers. You need to store this data in one-second interval samples. Analysts will be performing real-time, ad hoc analytics against the database. You want to avoid being charged for every query executed and ensure that the schema design will allow for future growth of the dataset. Which database and data model should you choose?
- A. Create a narrow table in Cloud Bigtable with a row key that combines the Computer Engine computer identifier with the sample time at each second
- B. Create a wide table in Cloud Bigtable with a row key that combines the computer identifier with the sample time at each minute, and combine the values for each second as column data.
- C. Create a table in BigQuery, and append the new samples for CPU and memory to the table
- D. Create a wide table in BigQuery, create a column for the sample value at each second, and update the row with the interval for each second
正解:A
解説:
A tall and narrow table has a small number of events per row, which could be just one event, whereas a short and wide table has a large number of events per row. As explained in a moment, tall and narrow tables are best suited for time-series data. For time series, you should generally use tall and narrow tables. This is for two reasons: Storing one event per row makes it easier to run queries against your data. Storing many events per row makes it more likely that the total row size will exceed the recommended maximum (see Rows can be big but are not infinite). https://cloud.google.com/bigtable/docs/schema-design-time-series#patterns_for_row_key_design
質問 # 80
You have spent a few days loading data from comma-separated values (CSV) files into the Google BigQuery table CLICK_STREAM. The column DT stores the epoch time of click events. For convenience, you chose a simple schema where every field is treated as the STRING type. Now, you want to compute web session durations of users who visit your site, and you want to change its data type to the TIMESTAMP. You want to minimize the migration effort without making future queries computationally expensive. What should you do?
- A. Construct a query to return every row of the table CLICK_STREAM, while using the built-in function to cast strings from the column DT into TIMESTAMP values. Run the query into a destination table NEW_CLICK_STREAM, in which the column TS is the TIMESTAMP type. the table NEW_CLICK_STREAM instead of the table CLICK_STREAM from now on. In the future, new data is loaded into the table NEW_CLICK_STREAM.
- B. Delete the table CLICK_STREAM, and then re-create it such that the column DT is of the TIMESTAMP type. Reload the data.
- C. Create a view CLICK_STREAM_V, where strings from the column DT are cast into TIMESTAMP values. the view CLICK_STREAM_V instead of the table CLICK_STREAM from now on.
- D. Add two columns to the table CLICK STREAM: TS of the TIMESTAMP type and IS_NEW of the BOOLEAN type. Reload all data in append mode. For each appended row, set the value of IS_NEW to true. For future queries, the column TS instead of the column DT, with the WHERE clause ensuring that the value of IS_NEW must be true.
- E. Add a column TS of the TIMESTAMP type to the table CLICK_STREAM, and populate the numeric values from the column TS for each row. the column TS instead of the column DT from now on.
正解:D
質問 # 81
You need to move 2 PB of historical data from an on-premises storage appliance to Cloud Storage within six months, and your outbound network capacity is constrained to 20 Mb/sec. How should you migrate this data to Cloud Storage?
- A. Use Transfer Appliance to copy the data to Cloud Storage
- B. Use gsutil cp -J to compress the content being uploaded to Cloud Storage
- C. Use trickle or ionice along with gsutil cp to limit the amount of bandwidth gsutil utilizes to less than 20 Mb/sec so it does not interfere with the production traffic
- D. Create a private URL for the historical data, and then use Storage Transfer Service to copy the data to Cloud Storage
正解:A
質問 # 82
Which of the following is NOT one of the three main types of triggers that Dataflow supports?
- A. Trigger based on time
- B. Trigger based on element count
- C. Trigger that is a combination of other triggers
- D. Trigger based on element size in bytes
正解:D
解説:
Explanation
There are three major kinds of triggers that Dataflow supports: 1. Time-based triggers 2. Data-driven triggers.
You can set a trigger to emit results from a window when that window has received a certain number of data elements. 3. Composite triggers. These triggers combine multiple time-based or data-driven triggers in some logical way Reference: https://cloud.google.com/dataflow/model/triggers
質問 # 83
You need to create a new transaction table in Cloud Spanner that stores product sales data. You are deciding what to use as a primary key. From a performance perspective, which strategy should you choose?
- A. A random universally unique identifier number (version 4 UUID)
- B. The current epoch time
- C. A concatenation of the product name and the current epoch time
- D. The original order identification number from the sales system, which is a monotonically increasing integer
正解:A
解説:
Explanation/Reference: https://www.uuidgenerator.net/version4
質問 # 84
......
この試験は、複数選択と複数選択の質問で構成され、2時間続きます。候補者は合計50の質問に答える必要があり、合格スコアは70%です。試験は英語と日本語で利用でき、オンラインまたはテストセンターで撮影できます。試験の料金は200ドルで、2年間有効です。
Google Professional-Data-Engineer 試験は、Google Cloud Platform上でのデータ処理システムの設計、構築、管理に必要なスキルと知識を検証するGoogleが提供する認定です。この認定は、Google Cloud上でのデータソリューションの設計や管理に専門知識を持つデータのプロフェッショナルを対象としています。試験には、データ処理システムの設計、データストレージソリューションの実装、データ処理インフラストラクチャの管理、およびデータのセキュリティとコンプライアンスなど、さまざまなトピックが含まれます。
最新版無料体験を掴み取れ!Google Professional-Data-Engineer問題集PDFは更新された:https://www.goshiken.com/Google/Professional-Data-Engineer-mondaishu.html
最新リリースのProfessional-Data-Engineer問題集はGoogle Cloud Certified認証済み:https://drive.google.com/open?id=1g-DHDPn495WQ7m3b-aJA9JlQiCLOwCsl