Excellent Professional-Data-Engineer Updated 2022 Dumps With 100% Exam Passing Guarantee [Q16-Q32]

Excellent Professional-Data-Engineer Updated 2022 Dumps With 100% Exam Passing Guarantee

Best way to practice test for Google Professional-Data-Engineer

NEW QUESTION 16
You are designing storage for 20 TB of text files as part of deploying a data pipeline on Google Cloud. Your input data is in CSV format. You want to minimize the cost of querying aggregate values for multiple users who will query the data in Cloud Storage with multiple engines. Which storage service and schema design should you use?

A. Use Cloud Storage for storage. Link as permanent tables in BigQuery for query.
B. Use Cloud Bigtable for storage. Link as permanent tables in BigQuery for query.
C. Use Cloud Bigtable for storage. Install the HBase shell on a Compute Engine instance to query the Cloud Bigtable data.
D. Use Cloud Storage for storage. Link as temporary tables in BigQuery for query.

Answer: C

NEW QUESTION 17
MJTelco Case Study
Company Overview
MJTelco is a startup that plans to build networks in rapidly growing, underserved markets around the world. The company has patents for innovative optical communications hardware. Based on these patents, they can create many reliable, high-speed backbone links with inexpensive hardware.
Company Background
Founded by experienced telecom executives, MJTelco uses technologies originally developed to overcome communications challenges in space. Fundamental to their operation, they need to create a distributed data infrastructure that drives real-time analysis and incorporates machine learning to continuously optimize their topologies. Because their hardware is inexpensive, they plan to overdeploy the network allowing them to account for the impact of dynamic regional politics on location availability and cost.
Their management and operations teams are situated all around the globe creating many-to-many relationship between data consumers and provides in their system. After careful consideration, they decided public cloud is the perfect environment to support their needs.
Solution Concept
MJTelco is running a successful proof-of-concept (PoC) project in its labs. They have two primary needs:
Scale and harden their PoC to support significantly more data flows generated when they ramp to more

than 50,000 installations.
Refine their machine-learning cycles to verify and improve the dynamic models they use to control

topology definition.
MJTelco will also use three separate operating environments - development/test, staging, and production
- to meet the needs of running experiments, deploying new features, and serving production customers.
Business Requirements
Scale up their production environment with minimal cost, instantiating resources when and where

needed in an unpredictable, distributed telecom user community.
Ensure security of their proprietary data to protect their leading-edge machine learning and analysis.

Provide reliable and timely access to data for analysis from distributed research workers

Maintain isolated environments that support rapid iteration of their machine-learning models without

affecting their customers.
Technical Requirements
Ensure secure and efficient transport and storage of telemetry data

Rapidly scale instances to support between 10,000 and 100,000 data providers with multiple flows

each.
Allow analysis and presentation against data tables tracking up to 2 years of data storing approximately

100m records/day
Support rapid iteration of monitoring infrastructure focused on awareness of data pipeline problems

both in telemetry flows and in production learning cycles.
CEO Statement
Our business model relies on our patents, analytics and dynamic machine learning. Our inexpensive hardware is organized to be highly reliable, which gives us cost advantages. We need to quickly stabilize our large distributed data pipelines to meet our reliability and capacity commitments.
CTO Statement
Our public cloud services must operate as advertised. We need resources that scale and keep our data secure. We also need environments in which our data scientists can carefully study and quickly adapt our models. Because we rely on automation to process our data, we also need our development and test environments to work as we iterate.
CFO Statement
The project is too large for us to maintain the hardware and software required for the data and analysis.
Also, we cannot afford to staff an operations team to monitor so many data feeds, so we will rely on automation and infrastructure. Google Cloud's machine learning will allow our quantitative researchers to work on our high-value problems instead of problems with our data pipelines.
MJTelco's Google Cloud Dataflow pipeline is now ready to start receiving data from the 50,000 installations. You want to allow Cloud Dataflow to scale its compute power up as required. Which Cloud Dataflow pipeline configuration setting should you update?

A. The number of workers
B. The disk size per worker
C. The zone
D. The maximum number of workers

Answer: C

NEW QUESTION 18
Which is the preferred method to use to avoid hotspotting in time series data in Bigtable?

A. Hashing
B. Randomization
C. Salting
D. Field promotion

Answer: D

Explanation:
Explanation
By default, prefer field promotion. Field promotion avoids hotspotting in almost all cases, and it tends to make it easier to design a row key that facilitates queries.
Reference:
https://cloud.google.com/bigtable/docs/schema-design-time-series#ensure_that_your_row_key_avoids_hotspotti

NEW QUESTION 19
You are using Google BigQuery as your data warehouse. Your users report that the following simple query is running very slowly, no matter when they run the query:
SELECT country, state, city FROM [myproject:mydataset.mytable] GROUP BY country You check the query plan for the query and see the following output in the Read section of Stage:1:

What is the most likely cause of the delay for this query?

A. Users are running too many concurrent queries in the system
B. The [myproject:mydataset.mytable] table has too many partitions
C. Most rows in the [myproject:mydataset.mytable]table have the same value in the country column, causing data skew
D. Either the state or the city columns in the [myproject:mydataset.mytable]table have too many NULL values

Answer: A

NEW QUESTION 20
You have a data pipeline that writes data to Cloud Bigtable using well-designed row keys. You want to monitor your pipeline to determine when to increase the size of you Cloud Bigtable cluster. Which two actions can you take to accomplish this? (Choose two.)

A. Monitor latency of read operations. Increase the size of the Cloud Bigtable cluster of read operations take longer than 100 ms.
B. Review Key Visualizer metrics. Increase the size of the Cloud Bigtable cluster when the Read pressure index is above 100.
C. Monitor the latency of write operations. Increase the size of the Cloud Bigtable cluster when there is a sustained increase in write latency.
D. Monitor storage utilization. Increase the size of the Cloud Bigtable cluster when utilization increases above
70% of max capacity.
E. Review Key Visualizer metrics. Increase the size of the Cloud Bigtable cluster when the Write pressure index is above 100.

Answer: B,C

NEW QUESTION 21
You need to create a data pipeline that copies time-series transaction data so that it can be queried from within BigQuery by your data science team for analysis. Every hour, thousands of transactions are updated with a new status. The size of the intitial dataset is 1.5 PB, and it will grow by 3 TB per day. The data is heavily structured, and your data science team will build machine learning models based on this data. You want to maximize performance and usability for your data science team. Which two strategies should you adopt? (Choose two.)

A. Develop a data pipeline where status updates are appended to BigQuery instead of updated.
B. Use BigQuery UPDATE to further reduce the size of the dataset.
C. Copy a daily snapshot of transaction data to Cloud Storage and store it as an Avro file. Use BigQuery's support for external data sources to query.
D. Denormalize the data as must as possible.
E. Preserve the structure of the data as much as possible.

Answer: A,C

NEW QUESTION 22
You work for a global shipping company. You want to train a model on 40 TB of data to predict which ships in each geographic region are likely to cause delivery delays on any given day. The model will be based on multiple attributes collected from multiple sources. Telemetry data, including location in GeoJSON format, will be pulled from each ship and loaded every hour. You want to have a dashboard that shows how many and which ships are likely to cause delays within a region. You want to use a storage solution that has native functionality for prediction and geospatial processing. Which storage solution should you use?

A. Cloud Datastore
B. Cloud Bigtable
C. Cloud SQL for PostgreSQL
D. BigQuery

Answer: D

NEW QUESTION 23
Which of the following are feature engineering techniques? (Select 2 answers)

A. Hidden feature layers
B. Feature prioritization
C. Crossed feature columns
D. Bucketization of a continuous feature

Answer: C,D

Explanation:
Selecting and crafting the right set of feature columns is key to learning an effective model. Bucketization is a process of dividing the entire range of a continuous feature into a set of consecutive bins/buckets, and then converting the original numerical feature into a bucket ID (as a categorical feature) depending on which bucket that value falls into. Using each base feature column separately may not be enough to explain the data. To learn the differences between different feature combinations, we can add crossed feature columns to the model.
Reference:
https://www.tensorflow.org/tutorials/wide#selecting_and_engineering_features_for_the_model

NEW QUESTION 24
You create an important report for your large team in Google Data Studio 360. The report uses Google BigQuery as its data source. You notice that visualizations are not showing data that is less than 1 hour old.
What should you do?

A. Clear your browser history for the past hour then reload the tab showing the virtualizations.
B. Disable caching by editing the report settings.
C. Disable caching in BigQuery by editing table details.
D. Refresh your browser tab showing the visualizations.

Answer: B

Explanation:
Explanation/Reference: https://support.google.com/datastudio/answer/7020039?hl=en

NEW QUESTION 25
You have a petabyte of analytics data and need to design a storage and processing platform for it. You must be able to perform data warehouse-style analytics on the data in Google Cloud and expose the dataset as files for batch analysis tools in other cloud providers. What should you do?

A. Store the warm data as files in Cloud Storage, and store the active data in BigQuery. Keep this ratio as
80% warm and 20% active.
B. Store and process the entire dataset in BigQuery.
C. Store and process the entire dataset in Cloud Bigtable.
D. Store the full dataset in BigQuery, and store a compressed copy of the data in a Cloud Storage bucket.

Answer: A

NEW QUESTION 26
You are designing a basket abandonment system for an ecommerce company. The system will send a message to a user based on these rules:
No interaction by the user on the site for 1 hour

Has added more than $30 worth of products to the basket Has not completed a

transaction
You use Google Cloud Dataflow to process the data and decide if a message should be sent. How should you design the pipeline?

A. Use a global window with a time based trigger with a delay of 60 minutes.
B. Use a session window with a gap time duration of 60 minutes.
C. Use a fixed-time window with a duration of 60 minutes.
D. Use a sliding time window with a duration of 60 minutes.

Answer: A

NEW QUESTION 27
Which of these rules apply when you add preemptible workers to a Dataproc cluster (select 2 answers)?

A. A Dataproc cluster cannot have only preemptible workers.
B. Preemptible workers cannot store data.
C. Preemptible workers cannot use persistent disk.
D. If a preemptible worker is reclaimed, then a replacement worker must be added manually.

Answer: A,B

Explanation:
Explanation
The following rules will apply when you use preemptible workers with a Cloud Dataproc cluster:
Processing only-Since preemptibles can be reclaimed at any time, preemptible workers do not store data.
Preemptibles added to a Cloud Dataproc cluster only function as processing nodes.
No preemptible-only clusters-To ensure clusters do not lose all workers, Cloud Dataproc cannot create preemptible-only clusters.
Persistent disk size-As a default, all preemptible workers are created with the smaller of 100GB or the primary worker boot disk size. This disk space is used for local caching of data and is not available through HDFS.
The managed group automatically re-adds workers lost due to reclamation as capacity permits.
Reference: https://cloud.google.com/dataproc/docs/concepts/preemptible-vms

NEW QUESTION 28
You have enabled the free integration between Firebase Analytics and Google BigQuery. Firebase now automatically creates a new table daily in BigQuery in the format app_events_YYYYMMDD. You want to query all of the tables for the past 30 days in legacy SQL. What should you do?

A. Use WHERE date BETWEEN YYYY-MM-DD AND YYYY-MM-DD
B. Use SELECT IF.(date >= YYYY-MM-DD AND date <= YYYY-MM-DD
C. Use the TABLE_DATE_RANGE function
D. Use the WHERE_PARTITIONTIME pseudo column

Answer: C

NEW QUESTION 29
You want to use a database of information about tissue samples to classify future tissue samples as either normal or mutated. You are evaluating an unsupervised anomaly detection method for classifying the tissue samples. Which two characteristic support this method? (Choose two.)

A. You expect future mutations to have similar features to the mutated samples in the database.
B. You expect future mutations to have different features from the mutated samples in the database.
C. You already have labels for which samples are mutated and which are normal in the database.
D. There are roughly equal occurrences of both normal and mutated samples in the database.
E. There are very few occurrences of mutations relative to normal samples.

Answer: B,D

NEW QUESTION 30
You're training a model to predict housing prices based on an available dataset with real estate properties.
Your plan is to train a fully connected neural net, and you've discovered that the dataset contains latitude and longitude of the property. Real estate professionals have told you that the location of the property is highly influential on price, so you'd like to engineer a feature that incorporates this physical dependency.
What should you do?

A. Create a feature cross of latitude and longitude, bucketize at the minute level and use L1 regularization during optimization.
B. Create a feature cross of latitude and longitude, bucketize it at the minute level and use L2 regularization during optimization.
C. Create a numeric column from a feature cross of latitude and longitude.
D. Provide latitude and longitude as input vectors to your neural net.

Answer: C

Explanation:
Explanation/Reference:
Reference https://cloud.google.com/bigquery/docs/gis-data

NEW QUESTION 31
You designed a database for patient records as a pilot project to cover a few hundred patients in three clinics. Your design used a single database table to represent all patients and their visits, and you used self-joins to generate reports. The server resource utilization was at 50%. Since then, the scope of the project has expanded. The database must now store 100 times more patient records. You can no longer run the reports, because they either take too long or they encounter errors with insufficient compute resources. How should you adjust the database design?

A. Add capacity (memory and disk space) to the database server by the order of 200.
B. Normalize the master patient-record table into the patient table and the visits table, and create other necessary tables to avoid self-join.
C. Shard the tables into smaller ones based on date ranges, and only generate reports with prespecified date ranges.
D. Partition the table into smaller tables, with one for each clinic. Run queries against the smaller table pairs, and use unions for consolidated reports.

Answer: C

NEW QUESTION 32
......

Google Certified Professional Data Engineer Exam Certification Sample Questions and Practice Exam: https://www.testsdumps.com/Professional-Data-Engineer_real-exam-dumps.html

Real Exam Questions and Answers - Google Professional-Data-Engineer Dump is Ready: https://drive.google.com/open?id=1LDs_plgzo_jke9vR-4BpivFTSEXXO5TH

Excellent Professional-Data-Engineer Updated 2022 Dumps With 100% Exam Passing Guarantee [Q16-Q32]

Related Articles

Latest Test Dump

Useful Links

Contact Us