Q1.An administrator needs to design a strategy for the schema in a Redshift cluster. The administrator needs to
determine the optimal distribution style for the tables in the Redshift schema.
In which two circumstances would choosing EVEN distribution be most appropriate? (Choose two.)
- A: When the tables are highly denormalized and do NOT participate in frequent joins.
- B: When data must be grouped based on a specific key on a defined slice.
- C: When data transfer between nodes must be eliminated.
- D: When a new table has been loaded and it is unclear how it will be joined to dimension.
solution: B, D
Q2.A large grocery distributor receives daily depletion reports from the field in the form of gzip archives od CSV
files uploaded to Amazon S3. The files range from 500MB to 5GB. These files are processed daily by an EMR
Recently it has been observed that the file sizes vary, and the EMR jobs take too long. The distributor needs to
tune and optimize the data processing workflow with this limited information to improve the performance of the
Which recommendation should an administrator provide?
- A: Reduce the HDFS block size to increase the number of task processors.
- B: Use bzip2 or Snappy rather than gzip for the archives.
- C: Decompress the gzip archives and store the data as CSV files.
- D: Use Avro rather than gzip for the archives.
Q3.A web-hosting company is building a web analytics tool to capture clickstream data from all of the websites
hosted within its platform and to provide near-real-time business intelligence. This entire system is built on
AWS services. The web-hosting company is interested in using Amazon Kinesis to collect this data and
perform sliding window analytics.
What is the most reliable and fault-tolerant technique to get each website to send data to Amazon Kinesis with
- A: After receiving a request, each web server sends it to Amazon Kinesis using the Amazon Kinesis
PutRecord API. Use the sessionID as a partition key and set up a loop to retry until a success response is
- B: After receiving a request, each web server sends it to Amazon Kinesis using the Amazon Kinesis Producer
Library .addRecords method.
- C: Each web server buffers the requests until the count reaches 500 and sends them to Amazon Kinesis using
the Amazon Kinesis PutRecord API.
- D: After receiving a request, each web server sends it to Amazon Kinesis using the Amazon Kinesis
PutRecord API. Use the exponential back-off algorithm for retries until a successful response is received.
Q4.A company has several teams of analysts. Each team of analysts has their own cluster. The teams need to run
SQL queries using Hive, Spark-SQL, and Presto with Amazon EMR. The company needs to enable a
centralized metadata layer to expose the Amazon S3 objects as tables to the analysts.
Which approach meets the requirement for a centralized metadata layer?
- A: EMRFS consistent view with a common Amazon DynamoDB table
- B: Bootstrap action to change the Hive Metastore to an Amazon RDS database
- C: s3distcp with the outputManifest option to generate RDS DDL
- D: Naming scheme support with automatic partition discovery from Amazon S3
Q5.An administrator needs to manage a large catalog of items from various external sellers. The administrator
needs to determine if the items should be identified as minimally dangerous, dangerous, or highly dangerous
based on their textual descriptions. The administrator already has some items with the danger attribute, but
receives hundreds of new item descriptions every day without such classification.
The administrator has a system that captures dangerous goods reports from customer support team of from
What is a cost-effective architecture to solve this issue?
- A: Build a set of regular expression rules that are based on the existing examples, and run them on the
DynamoDB Streams as every new item description is added to the system.
- B: Build a Kinesis Streams process that captures and marks the relevant items in the dangerous goods
reports using a Lambda function once more than two reports have been filed.
- C: Build a machine learning model to properly classify dangerous goods and run it on the DynamoDB Streams
as every new item description is added to the system.
- D: Build a machine learning model with binary classification for dangerous goods and run it on the DynamoDB
Streams as every new item description is added to the system.
Q6.A Redshift data warehouse has different user teams that need to query the same table with very different query
types. These user teams are experiencing poor performance.
Which action improves performance for the user teams in this situation?
- A: Create custom table views.
- B: Add interleaved sort keys per team.
- C: Maintain team-specific copies of the table.
- D: Add support for workload management queue hopping.
Q7.A company operates an international business served from a single AWS region. The company wants to
expand into a new country. The regulator for that country requires the Data Architect to maintain a log of
financial transactions in the country within 24 hours of the product transaction. The production application is
latency insensitive. The new country contains another AWS region.
What is the most cost-effective way to meet this requirement?
- A: Use CloudFormation to replicate the production application to the new region.
- B: Use Amazon CloudFront to serve application content locally in the country; Amazon CloudFront logs will
satisfy the requirement.
- C: Continue to serve customers from the existing region while using Amazon Kinesis to stream transaction
data to the regulator.
- D: Use Amazon S3 cross-region replication to copy and persist production transaction logs to a bucket in the
new country's region.
Q8.An administrator needs to design the event log storage architecture for events from mobile devices. The event
data will be processed by an Amazon EMR cluster daily for aggregated reporting and analytics before being
How should the administrator recommend storing the log data?
- A: Create an Amazon S3 bucket and write log data into folders by device. Execute the EMR job on the device
- B: Create an Amazon DynamoDB table partitioned on the device and sorted on date, write log data to table.
Execute the EMR job on the Amazon DynamoDB table.
- C: Create an Amazon S3 bucket and write data into folders by day. Execute the EMR job on the daily folder.
- D: Create an Amazon DynamoDB table partitioned on EventID, write log data to table. Execute the EMR job
on the table.
Q9.A data engineer wants to use an Amazon Elastic Map Reduce for an application. The data engineer needs to
make sure it complies with regulatory requirements. The auditor must be able to confirm at any point which
servers are running and which network access controls are deployed.
Which action should the data engineer take to meet this requirement?
- A: Provide the auditor IAM accounts with the SecurityAudit policy attached to their group.
- B: Provide the auditor with SSH keys for access to the Amazon EMR cluster.
- C: Provide the auditor with CloudFormation templates.
- D: Provide the auditor with access to AWS DirectConnect to use their existing tools.
Q10.A social media customer has data from different data sources including RDS running MySQL, Redshift, and
Hive on EMR. To support better analysis, the customer needs to be able to analyze data from different data
sources and to combine the results.
What is the most cost-effective solution to meet these requirements?
- A: Load all data from a different database/warehouse to S3. Use Redshift COPY command to copy data to
Redshift for analysis.
- B: Install Presto on the EMR cluster where Hive sits. Configure MySQL and PostgreSQL connector to select
from different data sources in a single query.
- C: Spin up an Elasticsearch cluster. Load data from all three data sources and use Kibana to analyze.
- D: Write a program running on a separate EC2 instance to run queries to three different systems. Aggregate
the results after getting the responses from all three systems.