[2024年最新] 最高の試験Professional-Data-Engineer問題集は無料サイトの資料を試そう [Q49-Q66]

Share

[2024年最新] 最高の試験Professional-Data-Engineer問題集は無料サイトの資料を試そう

無料Google Cloud Certified Professional-Data-Engineerオフィシャル認証ガイドPDFをダウンロード

質問 # 49
Which of these numbers are adjusted by a neural network as it learns from a training dataset (select 2 answers)?

  • A. Biases
  • B. Input values
  • C. Continuous features
  • D. Weights

正解:A、D

解説:
Explanation
A neural network is a simple mechanism that's implemented with basic math. The only difference between the traditional programming model and a neural network is that you let the computer determine the parameters (weights and bias) by learning from training datasets.
Reference:
https://cloud.google.com/blog/big-data/2016/07/understanding-neural-networks-with-tensorflow-playground


質問 # 50
You are training a spam classifier. You notice that you are overfitting the training data. Which three actions can you take to resolve this problem? (Choose three.)

  • A. Reduce the number of training examples
  • B. Use a larger set of features
  • C. Decrease the regularization parameters
  • D. Get more training examples
  • E. Use a smaller set of features
  • F. Increase the regularization parameters

正解:B、C、D

解説:
Explanation/Reference:


質問 # 51
Your company has hired a new data scientist who wants to perform complicated analyses across very large datasets stored in Google Cloud Storage and in a Cassandra cluster on Google Compute Engine. The scientist primarily wants to create labelled data sets for machine learning projects, along with some visualization tasks.
She reports that her laptop is not powerful enough to perform her tasks and it is slowing her down. You want to help her perform her tasks. What should you do?

  • A. Deploy Google Cloud Datalab to a virtual machine (VM) on Google Compute Engine.
  • B. Host a visualization tool on a VM on Google Compute Engine.
  • C. Grant the user access to Google Cloud Shell.
  • D. Run a local version of Jupiter on the laptop.

正解:C

解説:
Explanation/Reference:


質問 # 52
Which of these statements about BigQuery caching is true?

  • A. BigQuery caches query results for 48 hours.
  • B. Query results are cached even if you specify a destination table.
  • C. There is no charge for a query that retrieves its results from cache.
  • D. By default, a query's results are not cached.

正解:C

解説:
When query results are retrieved from a cached results table, you are not charged for the query. BigQuery caches query results for 24 hours, not 48 hours. Query results are not cached if you specify a destination table. A query's results are always cached except under certain conditions, such as if you specify a destination table.
Reference: https://cloud.google.com/bigquery/querying-data#query-caching


質問 # 53
You work for a manufacturing plant that batches application log files together into a single log file once a day at
2:00 AM. You have written a Google Cloud Dataflow job to process that log file. You need to make sure the log file in processed once per day as inexpensively as possible. What should you do?

  • A. Configure the Cloud Dataflow job as a streaming job so that it processes the log data immediately.
  • B. Manually start the Cloud Dataflow job each morning when you get into the office.
  • C. Change the processing job to use Google Cloud Dataproc instead.
  • D. Create a cron job with Google App Engine Cron Service to run the Cloud Dataflow job.

正解:D


質問 # 54
You work on a regression problem in a natural language processing domain, and you have 100M labeled exmaples in your dataset. You have randomly shuffled your data and split your dataset into train and test samples (in a 90/10 ratio). After you trained the neural network and evaluated your model on a test set, you discover that the root-mean-squared error (RMSE) of your model is twice as high on the train set as on the test set. How should you improve the performance of your model?

  • A. Increase the share of the test sample in the train-test split.
  • B. Try out regularization techniques (e.g., dropout of batch normalization) to avoid overfitting.
  • C. Increase the complexity of your model by, e.g., introducing an additional layer or increase sizing the size of vocabularies or n-grams used.
  • D. Try to collect more data and increase the size of your dataset.

正解:B


質問 # 55
Your company receives both batch- and stream-based event data. You want to process the data using
Google Cloud Dataflow over a predictable time period. However, you realize that in some instances data
can arrive late or out of order. How should you design your Cloud Dataflow pipeline to handle data that is
late or out of order?

  • A. Set sliding windows to capture all the lagged data.
  • B. Use watermarks and timestamps to capture the lagged data.
  • C. Ensure every datasource type (stream or batch) has a timestamp, and use the timestamps to define
    the logic for lagged data.
  • D. Set a single global window to capture all the data.

正解:A


質問 # 56
You are building an ELT solution in BigQuery by using Dataform. You need to perform uniqueness and null value checks on your final tables. What should you do to efficiently integrate these checks into your pipeline?

  • A. Build BigQuery user-defined functions (UDFs).
  • B. Build Dataform assertions into your code
  • C. Create Dataplex data quality tasks.
  • D. Write a Spark-based stored procedure.

正解:B

解説:
Dataform assertions are data quality tests that find rows that violate one or more rules specified in the query. If the query returns any rows, the assertion fails. Dataform runs assertions every time it updates your SQL workflow and alerts you if any assertions fail. You can create assertions for all Dataform table types: tables, incremental tables, views, and materialized views. You can add built-in assertions to the config block of a table, such as nonNull and rowConditions, or create manual assertions with SQLX for advanced use cases.
Dataform automatically creates views in BigQuery that contain the results of compiled assertion queries, which you can inspect to debug failing assertions. Dataform assertions are an efficient way to integrate data quality checks into your ELT solution in BigQuery by using Dataform. References: Test tables with assertions
| Dataform | Google Cloud, Test data quality with assertions | Dataform, Data quality tests and documenting datasets | Dataform, Data quality testing with SQL assertions | Dataform


質問 # 57
Your company uses a proprietary system to send inventory data every 6 hours to a data ingestion service in the cloud. Transmitted data includes a payload of several fields and the timestamp of the transmission. If there are any concerns about a transmission, the system re-transmits the dat

  • A. Compute the hash value of each data entry, and compare it with all historical data.
  • B. Maintain a database table to store the hash value and other metadata for each data entry.
  • C. How should you deduplicate the data most efficiency?
  • D. Assign global unique identifiers (GUID) to each data entry.
  • E. Store each data entry as the primary key in a separate database and apply an index.

正解:E


質問 # 58
You are building a model to make clothing recommendations. You know a user's fashion pis likely to change over time, so you build a data pipeline to stream new data back to the model as it becomes available. How should you use this data to train the model?

  • A. Train on the existing data while using the new data as your test set.
  • B. Continuously retrain the model on just the new data.
  • C. Continuously retrain the model on a combination of existing data and the new data.
  • D. Train on the new data while using the existing data as your test set.

正解:C

解説:
We have to use a combination of old and new test data as well as training data.


質問 # 59
You have Google Cloud Dataflow streaming pipeline running with a Google Cloud Pub/Sub subscription as the source. You need to make an update to the code that will make the new Cloud Dataflow pipeline incompatible with the current version. You do not want to lose any data when making this update. What should you do?

  • A. Create a new pipeline that has a new Cloud Pub/Sub subscription and cancel the old pipeline.
  • B. Update the current pipeline and provide the transform mapping JSON object.
  • C. Update the current pipeline and use the drain flag.
  • D. Create a new pipeline that has the same Cloud Pub/Sub subscription and cancel the old pipeline.

正解:B

解説:
If any transform names in your pipeline have changed, you must supply a transform mapping and pass it using the --transformNameMapping option.
https://cloud.google.com/dataflow/docs/guides/updating-a-pipeline#preventing_compatibility_breaks


質問 # 60
You are building new real-time data warehouse for your company and will use Google BigQuery streaming inserts. There is no guarantee that data will only be sent in once but you do have a unique ID for each row of data and an event timestamp. You want to ensure that duplicates are not included while interactively querying data. Which query type should you use?

  • A. Include ORDER BY DESK on timestamp column and LIMIT to 1.
  • B. Use the LAG window function with PARTITION by unique ID along with WHERE LAG IS NOT NULL.
  • C. Use GROUP BY on the unique ID column and timestamp column and SUM on the values.
  • D. Use the ROW_NUMBER window function with PARTITION by unique ID along with WHERE row equals 1.

正解:D


質問 # 61
Which of the following statements about Legacy SQL and Standard SQL is not true?

  • A. One difference between the two query languages is how you specify fully-qualified table names (i.e.
    table names that include their associated project name).
  • B. Standard SQL is the preferred query language for BigQuery.
  • C. You need to set a query language for each dataset and the default is Standard SQL.
  • D. If you write a query in Legacy SQL, it might generate an error if you try to run it with Standard SQL.

正解:C

解説:
You do not set a query language for each dataset. It is set each time you run a query and the default query language is Legacy SQL.
Standard SQL has been the preferred query language since BigQuery 2.0 was released. In legacy SQL, to query a table with a project-qualified name, you use a colon, :, as a separator. In standard SQL, you use a period, ., instead.
Due to the differences in syntax between the two query languages (such as with project-qualified table names), if you write a query in Legacy SQL, it might generate an error if you try to run it with Standard SQL.
Reference:
https://cloud.google.com/bigquery/docs/reference/standard-sql/migrating-from-legacy-sql


質問 # 62
Your company is performing data preprocessing for a learning algorithm in Google Cloud Dataflow.
Numerous data logs are being are being generated during this step, and the team wants to analyze them. Due to the dynamic nature of the campaign, the data is growing exponentially every hour.
The data scientists have written the following code to read the data for a new key features in the logs.
BigQueryIO.Read
.named("ReadLogData")
.from("clouddataflow-readonly:samples.log_data")
You want to improve the performance of this data read. What should you do?

  • A. Use .fromQuery operation to read specific fields from the table.
  • B. Use of both the Google BigQuery TableSchema and TableFieldSchema classes.
  • C. Call a transform that returns TableRow objects, where each element in the PCollection represents a single row in the table.
  • D. Specify the TableReference object in the code.

正解:C


質問 # 63
You are migrating a large number of files from a public HTTPS endpoint to Cloud Storage. The files are protected from unauthorized access using signed URLs. You created a TSV file that contains the list of object URLs and started a transfer job by using Storage Transfer Service. You notice that the job has run for a long time and eventually failed Checking the logs of the transfer job reveals that the job was running fine until one point, and then it failed due to HTTP 403 errors on the remaining files You verified that there were no changes to the source system You need to fix the problem to resume the migration process. What should you do?

  • A. Create a new TSV file for the remaining files by generating signed URLs with a longer validity period.Split the TSV file into multiple smaller files and submit them as separate Storage Transfer Service jobs in parallel.
  • B. Renew the TLS certificate of the HTTPS endpoint Remove the completed files from the TSV file and rerun the Storage Transfer Service job.
  • C. Set up Cloud Storage FUSE, and mount the Cloud Storage bucket on a Compute Engine Instance Remove the completed files from the TSV file Use a shell script to iterate through the TSV file and download the remaining URLs to the FUSE mount point.
  • D. Update the file checksums in the TSV file from using MD5 to SHA256. Remove the completed files from the TSV file and rerun the Storage Transfer Service job.

正解:A

解説:
A signed URL is a URL that provides limited permission and time to access a resource on a web server. It is often used to grant temporary access to protected files without requiring authentication. Storage Transfer Service is a service that allows you to transfer data from external sources, such as HTTPS endpoints, to Cloud Storage buckets. You can use a TSV file to specify the list of URLs to transfer. In this scenario, the most likely cause of the HTTP 403 errors is that the signed URLs have expired before the transfer job could complete.
This could happen if the signed URLs have a short validity period or the transfer job takes a long time due to the large number of files or network latency. To fix the problem, you need to create a new TSV file for the remaining files by generating new signed URLs with a longer validity period. This will ensure that the URLs do not expire before the transfer job finishes. You can use the Cloud Storage tools or your own program to generate signed URLs. Additionally, you can split the TSV file into multiple smaller files and submit them as separate Storage Transfer Service jobs in parallel. This will speed up the transfer process and reduce the risk of errors. References:
* Signed URLs | Cloud Storage Documentation
* V4 signing process with Cloud Storage tools
* V4 signing process with your own program
* Using a URL list file
* What Is a 403 Forbidden Error (and How Can I Fix It)?


質問 # 64
Which of these statements about exporting data from BigQuery is false?

  • A. The only compression option available is GZIP.
  • B. The only supported export destination is Google Cloud Storage.
  • C. Data can only be exported in JSON or Avro format.
  • D. To export more than 1 GB of data, you need to put a wildcard in the destination filename.

正解:C

解説:
Explanation
Data can be exported in CSV, JSON, or Avro format. If you are exporting nested or repeated data, then CSV format is not supported.
Reference: https://cloud.google.com/bigquery/docs/exporting-data


質問 # 65
You are planning to use Google's Dataflow SDK to analyze customer data such as displayed below. Your project requirement is to extract only the customer name from the data source and then write to an output PCollection.
Tom,555 X street
Tim,553 Y street
Sam, 111 Z street
Which operation is best suited for the above data processing requirement?

  • A. Source API
  • B. Sink API
  • C. Data extraction
  • D. ParDo

正解:D

解説:
Explanation
In Google Cloud dataflow SDK, you can use the ParDo to extract only a customer name of each element in your PCollection.
Reference: https://cloud.google.com/dataflow/model/par-do


質問 # 66
......

Google Professional-Data-Engineerオフィシャル認証ガイドPDF:https://www.goshiken.com/Google/Professional-Data-Engineer-mondaishu.html

試験Professional-Data-EngineerのGoogle Certified Professional Data Engineer Examの問題集にはここにある:https://drive.google.com/open?id=1mrI9cfaGutkyfLK7AKASocfFy7chfrgK