
2024年最新のGoShiken Google Professional-Data-Engineer問題集と試験テストエンジン
Google Professional-Data-Engineer問題集にはリアル試験問題解答
Google Certified Professional Data Engineer 試験は、Google Cloud Platform上でデータ処理システムを設計および構築する知識を持つ個人を対象として、Googleから提供されている認定試験です。この試験は、Google Cloud Platform上のデータ処理システム、機械学習、データ分析ツールに関する候補者の知識を試験するように作られています。
質問 # 131
You need to migrate a 2TB relational database to Google Cloud Platform. You do not have the resources to significantly refactor the application that uses this database and cost to operate is of primary concern.
Which service do you select for storing and serving your data?
- A. Cloud Bigtable
- B. Cloud Firestore
- C. Cloud SQL
- D. Cloud Spanner
正解:C
質問 # 132
When creating a new Cloud Dataproc cluster with the projects.regions.clusters.create operation, these four values are required: project, region, name, and ____.
- A. label
- B. zone
- C. node
- D. type
正解:B
解説:
At a minimum, you must specify four values when creating a new cluster with the projects.regions.clusters.create operation:
The project in which the cluster will be created
The region to use
The name of the cluster
The zone in which the cluster will be created
You can specify many more details beyond these minimum requirements. For example, you can also specify the number of workers, whether preemptible compute should be used, and the network settings.
Reference:
https://cloud.google.com/dataproc/docs/tutorials/python-library-example#create_a_new_cloud_dataproc_cluste
質問 # 133
Your company is performing data preprocessing for a learning algorithm in Google Cloud Dataflow.
Numerous data logs are being are being generated during this step, and the team wants to analyze them.
Due to the dynamic nature of the campaign, the data is growing exponentially every hour.
The data scientists have written the following code to read the data for a new key features in the logs.
BigQueryIO.Read
.named("ReadLogData")
.from("clouddataflow-readonly:samples.log_data")
You want to improve the performance of this data read. What should you do?
- A. Call a transform that returns TableRowobjects, where each element in the PCollectionrepresents a single row in the table.
- B. Specify the TableReferenceobject in the code.
- C. Use of both the Google BigQuery TableSchemaand TableFieldSchemaclasses.
- D. Use .fromQueryoperation to read specific fields from the table.
正解:A
質問 # 134
You are building a new application that you need to collect data from in a scalable way. Data arrives continuously from the application throughout the day, and you expect to generate approximately 150 GB of JSON data per day by the end of the year. Your requirements are:
* Decoupling producer from consumer
* Space and cost-efficient storage of the raw ingested data, which is to be stored indefinitely
* Near real-time SQL query
* Maintain at least 2 years of historical data, which will be queried with SQ Which pipeline should you use to meet these requirements?
- A. Create an application that provides an API. Write a tool to poll the API and write data to Cloud Storage as gzipped JSON files.
- B. Create an application that publishes events to Cloud Pub/Sub, and create a Cloud Dataflow pipeline that transforms the JSON event payloads to Avro, writing the data to Cloud Storage and BigQuery.
- C. Create an application that publishes events to Cloud Pub/Sub, and create Spark jobs on Cloud Dataproc to convert the JSON data to Avro format, stored on HDFS on Persistent Disk.
- D. Create an application that writes to a Cloud SQL database to store the data. Set up periodic exports of the database to write to Cloud Storage and load into BigQuery.
正解:A
質問 # 135
You have created an external table for Apache Hive partitioned data that resides in a Cloud Storage bucket, which contains a large number of files. You notice that queries against this table are slow You want to improve the performance of these queries What should you do?
- A. Migrate the Hive partitioned data objects to a multi-region Cloud Storage bucket.
- B. Create an individual external table for each Hive partition by using a common table name prefix Use wildcard table queries to reference the partitioned data.
- C. Upgrade the external table to a BigLake table Enable metadata caching for the table.
- D. Change the storage class of the Hive partitioned data objects from Coldline to Standard.
正解:C
解説:
BigLake is a Google Cloud service that allows you to query structured data in external data stores such as Cloud Storage, Amazon S3, and Azure Blob Storage with access delegation and governance. BigLake tables extend the capabilities of BigQuery to data lakes and enable a flexible, open lakehouse architecture. By upgrading an external table to a BigLake table, you can improve the performance of your queries by leveraging the BigQuery storage API, which supports data format conversion, predicate pushdown, column projection, and metadata caching. Metadata caching reduces the number of requests to the external data store and speeds up query execution. To upgrade an external table to a BigLake table, you can use the ALTER TABLE statement with the SET OPTIONS clause and specify the enable_metadata_caching option as true.
For example:
SQL
ALTER TABLE hive_partitioned_data
SET OPTIONS (
enable_metadata_caching=true
);
AI-generated code. Review and use carefully. More info on FAQ.
References:
* Introduction to BigLake tables
* Upgrade an external table to BigLake
* BigQuery storage API
質問 # 136
Which of these is not a supported method of putting data into a partitioned table?
- A. Use ORDER BY to put a table's rows into chronological order and then change the table's type to "Partitioned".
- B. Run a query to get the records for a specific day from an existing table and for the destination table, specify a partitioned table ending with the day in the format "$YYYYMMDD".
- C. If you have existing data in a separate file for each day, then create a partitioned table and upload each file into the appropriate partition.
- D. Create a partitioned table and stream new records to it every day.
正解:A
解説:
You cannot change an existing table into a partitioned table. You must create a partitioned table from scratch. Then you can either stream data into it every day and the data will automatically be put in the right partition, or you can load data into a specific partition by using "$YYYYMMDD" at the end of the table name.
Reference: https://cloud.google.com/bigquery/docs/partitioned-tables
質問 # 137
Your company built a TensorFlow neutral-network model with a large number of neurons and layers. The
model fits well for the training data. However, when tested against new data, it performs poorly. What
method can you employ to address this?
- A. Serialization
- B. Threading
- C. Dimensionality Reduction
- D. Dropout Methods
正解:D
解説:
Explanation/Reference:
Reference: https://medium.com/mlreview/a-simple-deep-learning-model-for-stock-price-prediction-using-
tensorflow-30505541d877
質問 # 138
MJTelco is building a custom interface to share dat
a. They have these requirements:
They need to do aggregations over their petabyte-scale datasets.
They need to scan specific time range rows with a very fast response time (milliseconds).
Which combination of Google Cloud Platform products should you recommend?
- A. Cloud Datastore and Cloud Bigtable
- B. BigQuery and Cloud Storage
- C. BigQuery and Cloud Bigtable
- D. Cloud Bigtable and Cloud SQL
正解:C
質問 # 139
You are developing a new deep teaming model that predicts a customer's likelihood to buy on your ecommerce site. Alter running an evaluation of the model against both the original training data and new test data, you find that your model is overfitting the data. You want to improve the accuracy of the model when predicting new data. What should you do?
- A. Reduce the size of the training dataset, and decrease the number of input features.
- B. Increase the size of the training dataset, and increase the number of input features.
- C. Increase the size of the training dataset, and decrease the number of input features.
- D. Reduce the size of the training dataset, and increase the number of input features.
正解:C
解説:
https://machinelearningmastery.com/impact-of-dataset-size-on-deep-learning-model-skill-and-performance-estim
質問 # 140
You are migrating your data warehouse to BigQuery. You have migrated all of your data into tables in a dataset. Multiple users from your organization will be using the dat
a. They should only see certain tables based on their team membership. How should you set user permissions?
- A. Assign the users/groups data viewer access at the table level for each table
- B. Create authorized views for each team in the same dataset in which the data resides, and assign the users/groups data viewer access to the authorized views
- C. Create SQL views for each team in the same dataset in which the data resides, and assign the users/groups data viewer access to the SQL views
- D. Create authorized views for each team in datasets created for each team. Assign the authorized views data viewer access to the dataset in which the data resides. Assign the users/groups data viewer access to the datasets in which the authorized views reside
正解:A
質問 # 141
Your financial services company is moving to cloud technology and wants to store 50 TB of financial time-series data in the cloud. This data is updated frequently and new data will be streaming in all the time. Your company also wants to move their existing Apache Hadoop jobs to the cloud to get insights into this data. Which product should they use to store the data?
- A. Google Cloud Datastore
- B. Google Cloud Storage
- C. Google BigQuery
- D. Cloud Bigtable
正解:D
解説:
Explanation/Reference: https://cloud.google.com/bigtable/docs/schema-design-time-series
質問 # 142
Which of the following statements about Legacy SQL and Standard SQL is not true?
- A. If you write a query in Legacy SQL, it might generate an error if you try to run it with Standard SQL.
- B. You need to set a query language for each dataset and the default is Standard SQL.
- C. Standard SQL is the preferred query language for BigQuery.
- D. One difference between the two query languages is how you specify fully-qualified table names (i.e.
table names that include their associated project name).
正解:B
解説:
You do not set a query language for each dataset. It is set each time you run a query and the default query language is Legacy SQL.
Standard SQL has been the preferred query language since BigQuery 2.0 was released. In legacy SQL, to query a table with a project-qualified name, you use a colon, :, as a separator. In standard SQL, you use a period, ., instead.
Due to the differences in syntax between the two query languages (such as with project-qualified table names), if you write a query in Legacy SQL, it might generate an error if you try to run it with Standard SQL.
Reference:
https://cloud.google.com/bigquery/docs/reference/standard-sql/migrating-from-legacy-sql
質問 # 143
You have several Spark jobs that run on a Cloud Dataproc cluster on a schedule. Some of the jobs run in sequence, and some of the jobs run concurrently. You need to automate this process. What should you do?
- A. Create a Bash script that uses the Cloud SDK to create a cluster, execute jobs, and then tear down the cluster
- B. Create a Cloud Dataproc Workflow Template
- C. Create an initialization action to execute the jobs
- D. Create a Directed Acyclic Graph in Cloud Composer
正解:B
解説:
https://cloud.google.com/dataproc/docs/concepts/workflows/using-workflows
質問 # 144
Which is the preferred method to use to avoid hotspotting in time series data in Bigtable?
- A. Hashing
- B. Randomization
- C. Salting
- D. Field promotion
正解:D
解説:
By default, prefer field promotion. Field promotion avoids hotspotting in almost all cases, and it tends to make it easier to design a row key that facilitates queries.
Reference: https://cloud.google.com/bigtable/docs/schema-design-time-
series#ensure_that_your_row_key_avoids_hotspotting
質問 # 145
Case Study: 2 - MJTelco
Company Overview
MJTelco is a startup that plans to build networks in rapidly growing, underserved markets around the world. The company has patents for innovative optical communications hardware. Based on these patents, they can create many reliable, high-speed backbone links with inexpensive hardware.
Company Background
Founded by experienced telecom executives, MJTelco uses technologies originally developed to overcome communications challenges in space. Fundamental to their operation, they need to create a distributed data infrastructure that drives real-time analysis and incorporates machine learning to continuously optimize their topologies. Because their hardware is inexpensive, they plan to overdeploy the network allowing them to account for the impact of dynamic regional politics on location availability and cost. Their management and operations teams are situated all around the globe creating many-to- many relationship between data consumers and provides in their system. After careful consideration, they decided public cloud is the perfect environment to support their needs.
Solution Concept
MJTelco is running a successful proof-of-concept (PoC) project in its labs. They have two primary needs:
Scale and harden their PoC to support significantly more data flows generated when they ramp to more than 50,000 installations.
Refine their machine-learning cycles to verify and improve the dynamic models they use to control topology definition.
MJTelco will also use three separate operating environments ?development/test, staging, and production ?
to meet the needs of running experiments, deploying new features, and serving production customers.
Business Requirements
Scale up their production environment with minimal cost, instantiating resources when and where needed in an unpredictable, distributed telecom user community. Ensure security of their proprietary data to protect their leading-edge machine learning and analysis.
Provide reliable and timely access to data for analysis from distributed research workers Maintain isolated environments that support rapid iteration of their machine-learning models without affecting their customers.
Technical Requirements
Ensure secure and efficient transport and storage of telemetry data Rapidly scale instances to support between 10,000 and 100,000 data providers with multiple flows each.
Allow analysis and presentation against data tables tracking up to 2 years of data storing approximately
100m records/day
Support rapid iteration of monitoring infrastructure focused on awareness of data pipeline problems both in telemetry flows and in production learning cycles.
CEO Statement
Our business model relies on our patents, analytics and dynamic machine learning. Our inexpensive hardware is organized to be highly reliable, which gives us cost advantages. We need to quickly stabilize our large distributed data pipelines to meet our reliability and capacity commitments.
CTO Statement
Our public cloud services must operate as advertised. We need resources that scale and keep our data secure. We also need environments in which our data scientists can carefully study and quickly adapt our models. Because we rely on automation to process our data, we also need our development and test environments to work as we iterate.
CFO Statement
The project is too large for us to maintain the hardware and software required for the data and analysis.
Also, we cannot afford to staff an operations team to monitor so many data feeds, so we will rely on automation and infrastructure. Google Cloud's machine learning will allow our quantitative researchers to work on our high-value problems instead of problems with our data pipelines.
Given the record streams MJTelco is interested in ingesting per day, they are concerned about the cost of Google BigQuery increasing. MJTelco asks you to provide a design solution. They require a single large data table called tracking_table. Additionally, they want to minimize the cost of daily queries while performing fine-grained analysis of each day's events. They also want to use streaming ingestion. What should you do?
- A. Create a table called tracking_table and include a DATE column.
- B. Create a table called tracking_table with a TIMESTAMP column to represent the day.
- C. Create sharded tables for each day following the pattern tracking_table_YYYYMMDD.
- D. Create a partitioned table called tracking_table and include a TIMESTAMP column.
正解:D
質問 # 146
You've migrated a Hadoop job from an on-prem cluster to dataproc and GCS. Your Spark job is a complicated analytical workload that consists of many shuffing operations and initial data are parquet files (on average 200-400 MB size each). You see some degradation in performance after the migration to Dataproc, so you'd like to optimize for it. You need to keep in mind that your organization is very cost- sensitive, so you'd like to continue using Dataproc on preemptibles (with 2 non-preemptible workers only) for this workload.
What should you do?
- A. Switch from HDDs to SSDs, copy initial data from GCS to HDFS, run the Spark job and copy results back to GCS.
- B. Switch from HDDs to SSDs, override the preemptible VMs configuration to increase the boot disk size.
- C. Switch to TFRecords formats (appr. 200MB per file) instead of parquet files.
- D. Increase the size of your parquet files to ensure them to be 1 GB minimum.
正解:A
質問 # 147
All Google Cloud Bigtable client requests go through a front-end server ______ they are sent to a Cloud Bigtable node.
- A. after
- B. before
- C. once
- D. only if
正解:B
解説:
In a Cloud Bigtable architecture all client requests go through a front-end server before they are sent to a Cloud Bigtable node.
The nodes are organized into a Cloud Bigtable cluster, which belongs to a Cloud Bigtable instance, which is a container for the cluster. Each node in the cluster handles a subset of the requests to the cluster.
When additional nodes are added to a cluster, you can increase the number of simultaneous requests that the cluster can handle, as well as the maximum throughput for the entire cluster.
質問 # 148
You are building a model to predict whether or not it will rain on a given day. You have thousands of input features and want to see if you can improve training speed by removing some features while having a minimum effect on model accuracy. What can you do?
- A. Eliminate features that are highly correlated to the output labels.
- B. Instead of feeding in each feature individually, average their values in batches of 3.
- C. Combine highly co-dependent features into one representative feature.
- D. Remove the features that have null values for more than 50% of the training records.
正解:C
質問 # 149
Which of the following is NOT one of the three main types of triggers that Dataflow supports?
- A. Trigger based on time
- B. Trigger that is a combination of other triggers
- C. Trigger based on element count
- D. Trigger based on element size in bytes
正解:D
解説:
There are three major kinds of triggers that Dataflow supports: 1. Time-based triggers 2. Data-driven triggers. You can set a trigger to emit results from a window when that window has received a certain number of data elements. 3. Composite triggers. These triggers combine multiple time-based or data- driven triggers in some logical way Reference: https://cloud.google.com/dataflow/model/triggers
質問 # 150
MJTelco Case Study
Company Overview
MJTelco is a startup that plans to build networks in rapidly growing, underserved markets around the world. The company has patents for innovative optical communications hardware. Based on these patents, they can create many reliable, high-speed backbone links with inexpensive hardware.
Company Background
Founded by experienced telecom executives, MJTelco uses technologies originally developed to overcome communications challenges in space. Fundamental to their operation, they need to create a distributed data infrastructure that drives real-time analysis and incorporates machine learning to continuously optimize their topologies. Because their hardware is inexpensive, they plan to overdeploy the network allowing them to account for the impact of dynamic regional politics on location availability and cost.
Their management and operations teams are situated all around the globe creating many-to-many relationship between data consumers and provides in their system. After careful consideration, they decided public cloud is the perfect environment to support their needs.
Solution Concept
MJTelco is running a successful proof-of-concept (PoC) project in its labs. They have two primary needs:
* Scale and harden their PoC to support significantly more data flows generated when they ramp to more than
50,000 installations.
* Refine their machine-learning cycles to verify and improve the dynamic models they use to control topology definition.
MJTelco will also use three separate operating environments - development/test, staging, and production - to meet the needs of running experiments, deploying new features, and serving production customers.
Business Requirements
* Scale up their production environment with minimal cost, instantiating resources when and where needed in an unpredictable, distributed telecom user community.
* Ensure security of their proprietary data to protect their leading-edge machine learning and analysis.
* Provide reliable and timely access to data for analysis from distributed research workers
* Maintain isolated environments that support rapid iteration of their machine-learning models without affecting their customers.
Technical Requirements
Ensure secure and efficient transport and storage of telemetry data
Rapidly scale instances to support between 10,000 and 100,000 data providers with multiple flows each.
Allow analysis and presentation against data tables tracking up to 2 years of data storing approximately 100m records/day Support rapid iteration of monitoring infrastructure focused on awareness of data pipeline problems both in telemetry flows and in production learning cycles.
CEO Statement
Our business model relies on our patents, analytics and dynamic machine learning. Our inexpensive hardware is organized to be highly reliable, which gives us cost advantages. We need to quickly stabilize our large distributed data pipelines to meet our reliability and capacity commitments.
CTO Statement
Our public cloud services must operate as advertised. We need resources that scale and keep our data secure.
We also need environments in which our data scientists can carefully study and quickly adapt our models.
Because we rely on automation to process our data, we also need our development and test environments to work as we iterate.
CFO Statement
The project is too large for us to maintain the hardware and software required for the data and analysis. Also, we cannot afford to staff an operations team to monitor so many data feeds, so we will rely on automation and infrastructure. Google Cloud's machine learning will allow our quantitative researchers to work on our high- value problems instead of problems with our data pipelines.
Given the record streams MJTelco is interested in ingesting per day, they are concerned about the cost of Google BigQuery increasing. MJTelco asks you to provide a design solution. They require a single large data table called tracking_table. Additionally, they want to minimize the cost of daily queries while performing fine-grained analysis of each day's events. They also want to use streaming ingestion. What should you do?
- A. Create a table called tracking_table and include a DATE column.
- B. Create a table called tracking_table with a TIMESTAMP column to represent the day.
- C. Create sharded tables for each day following the pattern tracking_table_YYYYMMDD.
- D. Create a partitioned table called tracking_table and include a TIMESTAMP column.
正解:D
質問 # 151
An external customer provides you with a daily dump of data from their database. The data flows into Google Cloud Storage GCS as comma-separated values (CSV) files. You want to analyze this data in Google BigQuery, but the data could have rows that are formatted incorrectly or corrupted. How should you build this pipeline?
- A. Run a Google Cloud Dataflow batch pipeline to import the data into BigQuery, and push errors to another dead-letter table for analysis.
- B. Import the data into BigQuery using the gcloud CLI and set max_bad_records to 0.
- C. Enable BigQuery monitoring in Google Stackdriver and create an alert.
- D. Use federated data sources, and check data in the SQL query.
正解:A
質問 # 152
You need to compose visualization for operations teams with the following requirements:
* Telemetry must include data from all 50,000 installations for the most recent 6 weeks (sampling once every minute)
* The report must not be more than 3 hours delayed from live data.
* The actionable report should only show suboptimal links.
* Most suboptimal links should be sorted to the top.
* Suboptimal links can be grouped and filtered by regional geography.
* User response time to load the report must be <5 seconds.
You create a data source to store the last 6 weeks of data, and create visualizations that allow viewers to see multiple date ranges, distinct geographic regions, and unique installation types. You always show the latest data without any changes to your visualizations. You want to avoid creating and updating new visualizations each month. What should you do?
- A. Load the data into relational database tables, write a Google App Engine application that queries all rows, summarizes the data across each criteria, and then renders results using the Google Charts and visualization API.
- B. Export the data to a spreadsheet, compose a series of charts and tables, one for each possible combination of criteria, and spread them across multiple tabs.
- C. Look through the current data and compose a series of charts and tables, one for each possible combination of criteria.
- D. Look through the current data and compose a small set of generalized charts and tables bound to criteria filters that allow value selection.
正解:D
質問 # 153
You have a BigQuery table that ingests data directly from a Pub/Sub subscription. The ingested data is encrypted with a Google-managed encryption key. You need to meet a new organization policy that requires you to use keys from a centralized Cloud Key Management Service (Cloud KMS) project to encrypt data at rest. What should you do?
- A. Create a new Pub/Sub topic with CMEK and use the existing BigQuery table by using Google-managed encryption key.
- B. Create a new BigOuory table by using customer-managed encryption keys (CMEK), and migrate the data from the old BigQuery table.
- C. Use Cloud KMS encryption key with Dataflow to ingest the existing Pub/Sub subscription to the existing BigQuery table.
- D. Create a new BigOuery table and Pub/Sub topic by using customer-managed encryption keys (CMEK), and migrate the data from the old Bigauery table.
正解:B
解説:
To use CMEK for BigQuery, you need to create a key ring and a key in Cloud KMS, and then specify the key resource name when creating or updating a BigQuery table. You cannot change the encryption type of an existing table, so you need to create a new table with CMEK and copy the data from the old table with Google-managed encryption key.
References:
* Customer-managed Cloud KMS keys | BigQuery | Google Cloud
* Creating and managing encryption keys | Cloud KMS Documentation | Google Cloud
質問 # 154
You are responsible for writing your company's ETL pipelines to run on an Apache Hadoop cluster. The pipeline will require some checkpointing and splitting pipelines. Which method should you use to write the pipelines?
- A. PigLatin using Pig
- B. HiveQL using Hive
- C. Java using MapReduce
- D. Python using MapReduce
正解:A
解説:
Pig is scripting language which can be used for checkpointing and splitting pipelines.
質問 # 155
......
Google Professional-Data-Engineer 試験に合格するためには、データエンジニアリング、データ分析、データウェアハウジングの経験が必要です。また、Google Cloud Platform のデータ処理技術、例えば Cloud Dataflow、BigQuery、Cloud Dataproc を使用したソリューションの設計と実装の経験が必要です。さらに、SQL、Python、Java プログラミング言語の優れた知識、データモデリングとデータ可視化の経験が必要です。
2024年最新のGoShiken Professional-Data-EngineerのPDFで最近更新された問題です:https://www.goshiken.com/Google/Professional-Data-Engineer-mondaishu.html
Professional-Data-Engineer試験には保証が付きます。更新されたのは333問があります:https://drive.google.com/open?id=17yioeve7ei5SQd5PFOxpfqv--BgNl2gY