
無料で使えるProfessional-Data-Engineer試験ブレーン問題集認定ガイド問題と解答
Professional-Data-Engineer認定概要最新のProfessional-Data-EngineerのPDF問題集
質問 # 94
Your software uses a simple JSON format for all messages. These messages are published to Google Cloud Pub/Sub, then processed with Google Cloud Dataflow to create a real-time dashboard for the CFO.
During testing, you notice that some messages are missing in the dashboard. You check the logs, and all messages are being published to Cloud Pub/Sub successfully. What should you do next?
- A. Use Google Stackdriver Monitoring on Cloud Pub/Sub to find the missing messages.
- B. Run a fixed dataset through the Cloud Dataflow pipeline and analyze the output.
- C. Switch Cloud Dataflow to pull messages from Cloud Pub/Sub instead of Cloud Pub/Sub pushing messages to Cloud Dataflow.
- D. Check the dashboard application to see if it is not displaying correctly.
正解:A
解説:
Stackdriver can be used to check the error like number of unack messages, publisher pushing messages faster.
質問 # 95
You are creating a new pipeline in Google Cloud to stream IoT data from Cloud Pub/Sub through Cloud Dataflow to BigQuery. While previewing the data, you notice that roughly 2% of the data appears to be corrupt. You need to modify the Cloud Dataflow pipeline to filter out this corrupt dat
a. What should you do?
- A. Add a Partition transform in Cloud Dataflow to separate valid data from corrupt data.
- B. Add a SideInput that returns a Boolean if the element is corrupt.
- C. Add a GroupByKey transform in Cloud Dataflow to group all of the valid data together and discard the rest.
- D. Add a ParDo transform in Cloud Dataflow to discard corrupt elements.
正解:D
質問 # 96
Data Analysts in your company have the Cloud IAM Owner role assigned to them in their projects to allow them to work with multiple GCP products in their projects. Your organization requires that all BigQuery data access logs be retained for 6 months. You need to ensure that only audit personnel in your company can access the data access logs for all projects. What should you do?
- A. Export the data access logs via a project-level export sink to a Cloud Storage bucket in a newly created projects for audit logs. Restrict access to the project with the exported logs.
- B. Export the data access logs via a project-level export sink to a Cloud Storage bucket in the Data Analysts' projects. Restrict access to the Cloud Storage bucket.
- C. Enable data access logs in each Data Analyst's project. Restrict access to Stackdriver Logging via Cloud IAM roles.
- D. Export the data access logs via an aggregated export sink to a Cloud Storage bucket in a newly created project for audit logs. Restrict access to the project that contains the exported logs.
正解:D
解説:
https://cloud.google.com/iam/docs/roles-audit-logging#scenario_external_auditors
質問 # 97
You need ads data to serve Al models and historical data tor analytics longtail and outlier data points need to be identified You want to cleanse the data n near-reel time before running it through Al models What should you do?
- A. Use Cloud Composer to identify longtail and outlier data points, and then output a usable dataset to BigQuery
- B. Use Cloud Storage as a data warehouse shell scripts tor processing, and BigQuery to create views tor desired datasets
- C. Use Dataflow to identity longtail and outber data points programmatically with BigQuery as a sink
- D. Use BigQuery to ingest prepare and then analyze the data and then run queries to create views
正解:D
質問 # 98
Which of these numbers are adjusted by a neural network as it learns from a training dataset (select 2 answers)?
- A. Continuous features
- B. Input values
- C. Weights
- D. Biases
正解:C、D
解説:
A neural network is a simple mechanism that's implemented with basic math. The only difference between the traditional programming model and a neural network is that you let the computer determine the parameters (weights and bias) by learning from training datasets.
質問 # 99
Which software libraries are supported by Cloud Machine Learning Engine?
- A. Theano and TensorFlow
- B. TensorFlow and Torch
- C. Theano and Torch
- D. TensorFlow
正解:D
解説:
Cloud ML Engine mainly does two things:
Enables you to train machine learning models at scale by running TensorFlow training applications in the cloud.
Hosts those trained models for you in the cloud so that you can use them to get predictions about new data.
Reference: https://cloud.google.com/ml-engine/docs/technical-overview#what_it_does
質問 # 100
You are designing a basket abandonment system for an ecommerce company. The system will send a message to a user based on these rules:
- No interaction by the user on the site for 1 hour
- Has added more than $30 worth of products to the basket
- Has not completed a transaction
You use Google Cloud Dataflow to process the data and decide if a message should be sent. How should you design the pipeline?
- A. Use a session window with a gap time duration of 60 minutes.
- B. Use a fixed-time window with a duration of 60 minutes.
- C. Use a sliding time window with a duration of 60 minutes.
- D. Use a global window with a time based trigger with a delay of 60 minutes.
正解:A
解説:
It will send a message per user after that user is inactive for 60 minutes. Session window works well for capturing a session per user basis.
質問 # 101
You want to build a managed Hadoop system as your data lake. The data transformation process is composed of a series of Hadoop jobs executed in sequence. To accomplish the design of separating storage from compute, you decided to use the Cloud Storage connector to store all input data, output data, and intermediary dat
a. However, you noticed that one Hadoop job runs very slowly with Cloud Dataproc, when compared with the on-premises bare-metal Hadoop environment (8-core nodes with 100-GB RAM). Analysis shows that this particular Hadoop job is disk I/O intensive. You want to resolve the issue. What should you do?
- A. Allocate more CPU cores of the virtual machine instances of the Hadoop cluster so that the networking bandwidth for each instance can scale up
- B. Allocate additional network interface card (NIC), and configure link aggregation in the operating system to use the combined throughput when working with Cloud Storage
- C. Allocate sufficient memory to the Hadoop cluster, so that the intermediary data of that particular Hadoop job can be held in memory
- D. Allocate sufficient persistent disk space to the Hadoop cluster, and store the intermediate data of that particular Hadoop job on native HDFS
正解:C
質問 # 102
You use BigQuery as your centralized analytics platform. New data is loaded every day, and an ETL pipeline modifies the original data and prepares it for the final users. This ETL pipeline is regularly modified and can generate errors, but sometimes the errors are detected only after 2 weeks. You need to provide a method to recover from these errors, and your backups should be optimized for storage costs. How should you organize your data in BigQuery and store your backups?
- A. Organize your data in a single table, export, and compress and store the BigQuery data in Cloud Storage.
- B. Organize your data in separate tables for each month, and duplicate your data on a separate dataset in BigQuery.
- C. Organize your data in separate tables for each month, and use snapshot decorators to restore the table to a time prior to the corruption.
- D. Organize your data in separate tables for each month, and export, compress, and store the data in Cloud Storage.
正解:C
解説:
Explanation
質問 # 103
Which of the following statements about the Wide & Deep Learning model are true? (Select 2 answers.)
- A. A good use for the wide and deep model is a recommender system.
- B. A good use for the wide and deep model is a small-scale linear regression problem.
- C. The wide model is used for generalization, while the deep model is used for memorization.
- D. The wide model is used for memorization, while the deep model is used for generalization.
正解:A、D
解説:
Explanation
Can we teach computers to learn like humans do, by combining the power of memorization and generalization? It's not an easy question to answer, but by jointly training a wide linear model (for memorization) alongside a deep neural network (for generalization), one can combine the strengths of both to bring us one step closer. At Google, we call it Wide & Deep Learning. It's useful for generic large-scale regression and classification problems with sparse inputs (categorical features with a large number of possible feature values), such as recommender systems, search, and ranking problems.
Reference: https://research.googleblog.com/2016/06/wide-deep-learning-better-together-with.html
質問 # 104
You work for a manufacturing plant that batches application log files together into a single log file once a day at 2:00 AM. You have written a Google Cloud Dataflow job to process that log file. You need to make sure the log file in processed once per day as inexpensively as possible. What should you do?
- A. Configure the Cloud Dataflow job as a streaming job so that it processes the log data immediately.
- B. Change the processing job to use Google Cloud Dataproc instead.
- C. Create a cron job with Google App Engine Cron Service to run the Cloud Dataflow job.
- D. Manually start the Cloud Dataflow job each morning when you get into the office.
正解:C
質問 # 105
Your company maintains a hybrid deployment with GCP, where analytics are performed on your anonymized customer data. The data are imported to Cloud Storage from your data center through parallel uploads to a data transfer server running on GCP. Management informs you that the daily transfers take too long and have asked you to fix the problem. You want to maximize transfer speeds. Which action should you take?
- A. Increase your network bandwidth from Compute Engine to Cloud Storage.
- B. Increase the CPU size on your server.
- C. Increase the size of the Google Persistent Disk on your server.
- D. Increase your network bandwidth from your datacenter to GCP.
正解:D
質問 # 106
You are training a spam classifier. You notice that you are overfitting the training data. Which three actions can you take to resolve this problem? (Choose three.)
- A. Use a larger set of features
- B. Use a smaller set of features
- C. Get more training examples
- D. Increase the regularization parameters
- E. Reduce the number of training examples
- F. Decrease the regularization parameters
正解:A、C、F
質問 # 107
The Dataflow SDKs have been recently transitioned into which Apache service?
- A. Apache Spark
- B. Apache Kafka
- C. Apache Hadoop
- D. Apache Beam
正解:D
解説:
Dataflow SDKs are being transitioned to Apache Beam, as per the latest Google directive
https://cloud.google.com/dataflow/docs/
質問 # 108
How can you get a neural network to learn about relationships between categories in a categorical feature?
- A. Create a hash bucket
- B. Create an embedding column
- C. Create a one-hot column
- D. Create a multi-hot column
正解:B
解説:
There are two problems with one-hot encoding. First, it has high dimensionality, meaning that instead of having just one value, like a continuous feature, it has many values, or dimensions. This makes computation more time-consuming, especially if a feature has a very large number of categories. The second problem is that it doesn't encode any relationships between the categories. They are completely independent from each other, so the network has no way of knowing which ones are similar to each other.
Both of these problems can be solved by representing a categorical feature with an embedding
column. The idea is that each category has a smaller vector with, let's say, 5 values in it. But unlike a one-hot vector, the values are not usually 0. The values are weights, similar to the weights that are used for basic features in a neural network. The difference is that each category has a set of weights (5 of them in this case).
You can think of each value in the embedding vector as a feature of the category. So, if two categories are very similar to each other, then their embedding vectors should be very similar too.
質問 # 109
......
ベストなGoogle Professional-Data-Engineer学習ガイドと問題集には2023:https://www.goshiken.com/Google/Professional-Data-Engineer-mondaishu.html
トップクラスGoogle Professional-Data-Engineer試験材料で学習ガイド!練習問題バージョンで挑もう:https://drive.google.com/open?id=1LD4JfWced6f_x90HMC0FQENkFfJz4zFz