AWS Certified Data Engineer – Associate DEA-C01 Topic 3
Question #: 81
Topic #: 1
A data engineer creates an AWS Glue Data Catalog table by using an AWS Glue crawler that is named Orders. The data engineer wants to add the following new partitions:
s3://transactions/orders/order_date=2023-01-01
s3://transactions/orders/order_date=2023-01-02
The data engineer must edit the metadata to include the new partitions in the table without scanning all the folders and files in the location of the table.
Which data definition language (DDL) statement should the data engineer use in Amazon Athena?
A. ALTER TABLE Orders ADD PARTITION(order_date=’2023-01-01’) LOCATION ‘s3://transactions/orders/order_date=2023-01-01’;
ALTER TABLE Orders ADD PARTITION(order_date=’2023-01-02’) LOCATION ‘s3://transactions/orders/order_date=2023-01-02’;
B. MSCK REPAIR TABLE Orders;
C. REPAIR TABLE Orders;
D. ALTER TABLE Orders MODIFY PARTITION(order_date=’2023-01-01’) LOCATION ‘s3://transactions/orders/2023-01-01’;
ALTER TABLE Orders MODIFY PARTITION(order_date=’2023-01-02’) LOCATION ‘s3://transactions/orders/2023-01-02’;
Question #: 82
Topic #: 1
A company stores 10 to 15 TB of uncompressed .csv files in Amazon S3. The company is evaluating Amazon Athena as a one-time query engine.
The company wants to transform the data to optimize query runtime and storage costs.
Which file format and compression solution will meet these requirements for Athena queries?
A. .csv format compressed with zip
B. JSON format compressed with bzip2
C. Apache Parquet format compressed with Snappy
D. Apache Avro format compressed with LZO
Question #: 83
Topic #: 1
A company uses Apache Airflow to orchestrate the company’s current on-premises data pipelines. The company runs SQL data quality check tasks as part of the pipelines. The company wants to migrate the pipelines to AWS and to use AWS managed services.
Which solution will meet these requirements with the LEAST amount of refactoring?
A. Setup AWS Outposts in the AWS Region that is nearest to the location where the company uses Airflow. Migrate the servers into Outposts hosted Amazon EC2 instances. Update the pipelines to interact with the Outposts hosted EC2 instances instead of the on-premises pipelines.
B. Create a custom Amazon Machine Image (AMI) that contains the Airflow application and the code that the company needs to migrate. Use the custom AMI to deploy Amazon EC2 instances. Update the network connections to interact with the newly deployed EC2 instances.
C. Migrate the existing Airflow orchestration configuration into Amazon Managed Workflows for Apache Airflow (Amazon MWAA). Create the data quality checks during the ingestion to validate the data quality by using SQL tasks in Airflow.
D. Convert the pipelines to AWS Step Functions workflows. Recreate the data quality checks in SQL as Python based AWS Lambda functions.
Question #: 84
Topic #: 1
A company uses Amazon EMR as an extract, transform, and load (ETL) pipeline to transform data that comes from multiple sources. A data engineer must orchestrate the pipeline to maximize performance.
Which AWS service will meet this requirement MOST cost effectively?
A. Amazon EventBridge
B. Amazon Managed Workflows for Apache Airflow (Amazon MWAA)
C. AWS Step Functions
D. AWS Glue Workflows
Question #: 85
Topic #: 1
An online retail company stores Application Load Balancer (ALB) access logs in an Amazon S3 bucket. The company wants to use Amazon Athena to query the logs to analyze traffic patterns.
A data engineer creates an unpartitioned table in Athena. As the amount of the data gradually increases, the response time for queries also increases. The data engineer wants to improve the query performance in Athena.
Which solution will meet these requirements with the LEAST operational effort?
A. Create an AWS Glue job that determines the schema of all ALB access logs and writes the partition metadata to AWS Glue Data Catalog.
B. Create an AWS Glue crawler that includes a classifier that determines the schema of all ALB access logs and writes the partition metadata to AWS Glue Data Catalog.
C. Create an AWS Lambda function to transform all ALB access logs. Save the results to Amazon S3 in Apache Parquet format. Partition the metadata. Use Athena to query the transformed data.
D. Use Apache Hive to create bucketed tables. Use an AWS Lambda function to transform all ALB access logs.
Question #: 86
Topic #: 1
A company has a business intelligence platform on AWS. The company uses an AWS Storage Gateway Amazon S3 File Gateway to transfer files from the company’s on-premises environment to an Amazon S3 bucket.
A data engineer needs to set up a process that will automatically launch an AWS Glue workflow to run a series of AWS Glue jobs when each file transfer finishes successfully.
Which solution will meet these requirements with the LEAST operational overhead?
A. Determine when the file transfers usually finish based on previous successful file transfers. Set up an Amazon EventBridge scheduled event to initiate the AWS Glue jobs at that time of day.
B. Set up an Amazon EventBridge event that initiates the AWS Glue workflow after every successful S3 File Gateway file transfer event.
C. Set up an on-demand AWS Glue workflow so that the data engineer can start the AWS Glue workflow when each file transfer is complete.
D. Set up an AWS Lambda function that will invoke the AWS Glue Workflow. Set up an event for the creation of an S3 object as a trigger for the Lambda function.
Question #: 87
Topic #: 1
A retail company uses Amazon Aurora PostgreSQL to process and store live transactional data. The company uses an Amazon Redshift cluster for a data warehouse.
An extract, transform, and load (ETL) job runs every morning to update the Redshift cluster with new data from the PostgreSQL database. The company has grown rapidly and needs to cost optimize the Redshift cluster.
A data engineer needs to create a solution to archive historical data. The data engineer must be able to run analytics queries that effectively combine data from live transactional data in PostgreSQL, current data in Redshift, and archived historical data. The solution must keep only the most recent 15 months of data in Amazon Redshift to reduce costs.
Which combination of steps will meet these requirements? (Choose two.)
A. Configure the Amazon Redshift Federated Query feature to query live transactional data that is in the PostgreSQL database.
B. Configure Amazon Redshift Spectrum to query live transactional data that is in the PostgreSQL database.
C. Schedule a monthly job to copy data that is older than 15 months to Amazon S3 by using the UNLOAD command. Delete the old data from the Redshift cluster. Configure Amazon Redshift Spectrum to access historical data in Amazon S3.
D. Schedule a monthly job to copy data that is older than 15 months to Amazon S3 Glacier Flexible Retrieval by using the UNLOAD command. Delete the old data from the Redshift cluster. Configure Redshift Spectrum to access historical data from S3 Glacier Flexible Retrieval.
E. Create a materialized view in Amazon Redshift that combines live, current, and historical data from different sources.
Question #: 88
Topic #: 1
A manufacturing company has many IoT devices in facilities around the world. The company uses Amazon Kinesis Data Streams to collect data from the devices. The data includes device ID, capture date, measurement type, measurement value, and facility ID. The company uses facility ID as the partition key.
The company’s operations team recently observed many WriteThroughputExceeded exceptions. The operations team found that some shards were heavily used but other shards were generally idle.
How should the company resolve the issues that the operations team observed?
A. Change the partition key from facility ID to a randomly generated key.
B. Increase the number of shards.
C. Archive the data on the producer’s side.
D. Change the partition key from facility ID to capture date.
Question #: 89
Topic #: 1
A data engineer wants to improve the performance of SQL queries in Amazon Athena that run against a sales data table.
The data engineer wants to understand the execution plan of a specific SQL statement. The data engineer also wants to see the computational cost of each operation in a SQL query.
Which statement does the data engineer need to run to meet these requirements?
A. EXPLAIN SELECT * FROM sales;
B. EXPLAIN ANALYZE FROM sales;
C. EXPLAIN ANALYZE SELECT * FROM sales;
D. EXPLAIN FROM sales;
Question #: 90
Topic #: 1
A company plans to provision a log delivery stream within a VPC. The company configured the VPC flow logs to publish to Amazon CloudWatch Logs. The company needs to send the flow logs to Splunk in near real time for further analysis.
Which solution will meet these requirements with the LEAST operational overhead?
A. Configure an Amazon Kinesis Data Streams data stream to use Splunk as the destination. Create a CloudWatch Logs subscription filter to send log events to the data stream.
B. Create an Amazon Kinesis Data Firehose delivery stream to use Splunk as the destination. Create a CloudWatch Logs subscription filter to send log events to the delivery stream.
C. Create an Amazon Kinesis Data Firehose delivery stream to use Splunk as the destination. Create an AWS Lambda function to send the flow logs from CloudWatch Logs to the delivery stream.
D. Configure an Amazon Kinesis Data Streams data stream to use Splunk as the destination. Create an AWS Lambda function to send the flow logs from CloudWatch Logs to the data stream.
Question #: 91
Topic #: 1
A company has a data lake on AWS. The data lake ingests sources of data from business units. The company uses Amazon Athena for queries. The storage layer is Amazon S3 with an AWS Glue Data Catalog as a metadata repository.
The company wants to make the data available to data scientists and business analysts. However, the company first needs to manage fine-grained, column-level data access for Athena based on the user roles and responsibilities.
Which solution will meet these requirements?
A. Set up AWS Lake Formation. Define security policy-based rules for the users and applications by IAM role in Lake Formation.
B. Define an IAM resource-based policy for AWS Glue tables. Attach the same policy to IAM user groups.
C. Define an IAM identity-based policy for AWS Glue tables. Attach the same policy to IAM roles. Associate the IAM roles with IAM groups that contain the users.
D. Create a resource share in AWS Resource Access Manager (AWS RAM) to grant access to IAM users.
Question #: 92
Topic #: 1
A company has developed several AWS Glue extract, transform, and load (ETL) jobs to validate and transform data from Amazon S3. The ETL jobs load the data into Amazon RDS for MySQL in batches once every day. The ETL jobs use a DynamicFrame to read the S3 data.
The ETL jobs currently process all the data that is in the S3 bucket. However, the company wants the jobs to process only the daily incremental data.
Which solution will meet this requirement with the LEAST coding effort?
A. Create an ETL job that reads the S3 file status and logs the status in Amazon DynamoDB.
B. Enable job bookmarks for the ETL jobs to update the state after a run to keep track of previously processed data.
C. Enable job metrics for the ETL jobs to help keep track of processed objects in Amazon CloudWatch.
D. Configure the ETL jobs to delete processed objects from Amazon S3 after each run.
Question #: 93
Topic #: 1
An online retail company has an application that runs on Amazon EC2 instances that are in a VPC. The company wants to collect flow logs for the VPC and analyze network traffic.
Which solution will meet these requirements MOST cost-effectively?
A. Publish flow logs to Amazon CloudWatch Logs. Use Amazon Athena for analytics.
B. Publish flow logs to Amazon CloudWatch Logs. Use an Amazon OpenSearch Service cluster for analytics.
C. Publish flow logs to Amazon S3 in text format. Use Amazon Athena for analytics.
D. Publish flow logs to Amazon S3 in Apache Parquet format. Use Amazon Athena for analytics.
Question #: 94
Topic #: 1
A retail company stores transactions, store locations, and customer information tables in four reserved ra3.4xlarge Amazon Redshift cluster nodes. All three tables use even table distribution.
The company updates the store location table only once or twice every few years.
A data engineer notices that Redshift queues are slowing down because the whole store location table is constantly being broadcast to all four compute nodes for most queries. The data engineer wants to speed up the query performance by minimizing the broadcasting of the store location table.
Which solution will meet these requirements in the MOST cost-effective way?
A. Change the distribution style of the store location table from EVEN distribution to ALL distribution.
B. Change the distribution style of the store location table to KEY distribution based on the column that has the highest dimension.
C. Add a join column named store_id into the sort key for all the tables.
D. Upgrade the Redshift reserved node to a larger instance size in the same instance family.
Question #: 95
Topic #: 1
A company has a data warehouse that contains a table that is named Sales. The company stores the table in Amazon Redshift. The table includes a column that is named city_name. The company wants to query the table to find all rows that have a city_name that starts with “San” or “El”.
Which SQL query will meet this requirement?
A. Select * from Sales where city_name ~ ‘$(San|El)*’;
B. Select * from Sales where city_name ~ ‘^(San|El)*’;
C. Select * from Sales where city_name ~’$(San&El)*’;
D. Select * from Sales where city_name ~ ‘^(San&El)*’;
Question #: 96
Topic #: 1
A company needs to send customer call data from its on-premises PostgreSQL database to AWS to generate near real-time insights. The solution must capture and load updates from operational data stores that run in the PostgreSQL database. The data changes continuously.
A data engineer configures an AWS Database Migration Service (AWS DMS) ongoing replication task. The task reads changes in near real time from the PostgreSQL source database transaction logs for each table. The task then sends the data to an Amazon Redshift cluster for processing.
The data engineer discovers latency issues during the change data capture (CDC) of the task. The data engineer thinks that the PostgreSQL source database is causing the high latency.
Which solution will confirm that the PostgreSQL database is the source of the high latency?
A. Use Amazon CloudWatch to monitor the DMS task. Examine the CDCIncomingChanges metric to identify delays in the CDC from the source database.
B. Verify that logical replication of the source database is configured in the postgresql.conf configuration file.
C. Enable Amazon CloudWatch Logs for the DMS endpoint of the source database. Check for error messages.
D. Use Amazon CloudWatch to monitor the DMS task. Examine the CDCLatencySource metric to identify delays in the CDC from the source database.
Question #: 97
Topic #: 1
A lab uses IoT sensors to monitor humidity, temperature, and pressure for a project. The sensors send 100 KB of data every 10 seconds. A downstream process will read the data from an Amazon S3 bucket every 30 seconds.
Which solution will deliver the data to the S3 bucket with the LEAST latency?
A. Use Amazon Kinesis Data Streams and Amazon Kinesis Data Firehose to deliver the data to the S3 bucket. Use the default buffer interval for Kinesis Data Firehose.
B. Use Amazon Kinesis Data Streams to deliver the data to the S3 bucket. Configure the stream to use 5 provisioned shards.
C. Use Amazon Kinesis Data Streams and call the Kinesis Client Library to deliver the data to the S3 bucket. Use a 5 second buffer interval from an application.
D. Use Amazon Managed Service for Apache Flink (previously known as Amazon Kinesis Data Analytics) and Amazon Kinesis Data Firehose to deliver the data to the S3 bucket. Use a 5 second buffer interval for Kinesis Data Firehose.
Question #: 98
Topic #: 1
A company wants to use machine learning (ML) to perform analytics on data that is in an Amazon S3 data lake. The company has two data transformation requirements that will give consumers within the company the ability to create reports.
The company must perform daily transformations on 300 GB of data that is in a variety format that must arrive in Amazon S3 at a scheduled time. The company must perform one-time transformations of terabytes of archived data that is in the S3 data lake. The company uses Amazon Managed Workflows for Apache Airflow (Amazon MWAA) Directed Acyclic Graphs (DAGs) to orchestrate processing.
Which combination of tasks should the company schedule in the Amazon MWAA DAGs to meet these requirements MOST cost-effectively? (Choose two.)
A. For daily incoming data, use AWS Glue crawlers to scan and identify the schema.
B. For daily incoming data, use Amazon Athena to scan and identify the schema.
C. For daily incoming data, use Amazon Redshift to perform transformations.
D. For daily and archived data, use Amazon EMR to perform data transformations.
E. For archived data, use Amazon SageMaker to perform data transformations.
Question #: 99
Topic #: 1
A retail company uses AWS Glue for extract, transform, and load (ETL) operations on a dataset that contains information about customer orders. The company wants to implement specific validation rules to ensure data accuracy and consistency.
Which solution will meet these requirements?
A. Use AWS Glue job bookmarks to track the data for accuracy and consistency.
B. Create custom AWS Glue Data Quality rulesets to define specific data quality checks.
C. Use the built-in AWS Glue Data Quality transforms for standard data quality validations.
D. Use AWS Glue Data Catalog to maintain a centralized data schema and metadata repository.
Question #: 100
Topic #: 1
An insurance company stores transaction data that the company compressed with gzip.
The company needs to query the transaction data for occasional audits.
Which solution will meet this requirement in the MOST cost-effective way?
A. Store the data in Amazon Glacier Flexible Retrieval. Use Amazon S3 Glacier Select to query the data.
B. Store the data in Amazon S3. Use Amazon S3 Select to query the data.
C. Store the data in Amazon S3. Use Amazon Athena to query the data.
D. Store the data in Amazon Glacier Instant Retrieval. Use Amazon Athena to query the data.
Question #: 101
Topic #: 1
A data engineer finished testing an Amazon Redshift stored procedure that processes and inserts data into a table that is not mission critical. The engineer wants to automatically run the stored procedure on a daily basis.
Which solution will meet this requirement in the MOST cost-effective way?
A. Create an AWS Lambda function to schedule a cron job to run the stored procedure.
B. Schedule and run the stored procedure by using the Amazon Redshift Data API in an Amazon EC2 Spot Instance.
C. Use query editor v2 to run the stored procedure on a schedule.
D. Schedule an AWS Glue Python shell job to run the stored procedure.
Question #: 102
Topic #: 1
A marketing company collects clickstream data. The company sends the clickstream data to Amazon Kinesis Data Firehose and stores the clickstream data in Amazon S3. The company wants to build a series of dashboards that hundreds of users from multiple departments will use.
The company will use Amazon QuickSight to develop the dashboards. The company wants a solution that can scale and provide daily updates about clickstream activity.
Which combination of steps will meet these requirements MOST cost-effectively? (Choose two.)
A. Use Amazon Redshift to store and query the clickstream data.
B. Use Amazon Athena to query the clickstream data
C. Use Amazon S3 analytics to query the clickstream data.
D. Access the query data through a QuickSight direct SQL query.
E. Access the query data through QuickSight SPICE (Super-fast, Parallel, In-memory Calculation Engine). Configure a daily refresh for the dataset.
Question #: 103
Topic #: 1
A data engineer is building a data orchestration workflow. The data engineer plans to use a hybrid model that includes some on-premises resources and some resources that are in the cloud. The data engineer wants to prioritize portability and open source resources.
Which service should the data engineer use in both the on-premises environment and the cloud-based environment?
A. AWS Data Exchange
B. Amazon Simple Workflow Service (Amazon SWF)
C. Amazon Managed Workflows for Apache Airflow (Amazon MWAA)
D. AWS Glue
Question #: 104
Topic #: 1
A gaming company uses a NoSQL database to store customer information. The company is planning to migrate to AWS.
The company needs a fully managed AWS solution that will handle high online transaction processing (OLTP) workload, provide single-digit millisecond performance, and provide high availability around the world.
Which solution will meet these requirements with the LEAST operational overhead?
A. Amazon Keyspaces (for Apache Cassandra)
B. Amazon DocumentDB (with MongoDB compatibility)
C. Amazon DynamoDB
D. Amazon Timestream
Question #: 105
Topic #: 1
A data engineer creates an AWS Lambda function that an Amazon EventBridge event will invoke. When the data engineer tries to invoke the Lambda function by using an EventBridge event, an AccessDeniedException message appears.
How should the data engineer resolve the exception?
A. Ensure that the trust policy of the Lambda function execution role allows EventBridge to assume the execution role.
B. Ensure that both the IAM role that EventBridge uses and the Lambda function’s resource-based policy have the necessary permissions.
C. Ensure that the subnet where the Lambda function is deployed is configured to be a private subnet.
D. Ensure that EventBridge schemas are valid and that the event mapping configuration is correct.
Question #: 106
Topic #: 1
A company uses a data lake that is based on an Amazon S3 bucket. To comply with regulations, the company must apply two layers of server-side encryption to files that are uploaded to the S3 bucket. The company wants to use an AWS Lambda function to apply the necessary encryption.
Which solution will meet these requirements?
A. Use both server-side encryption with AWS KMS keys (SSE-KMS) and the Amazon S3 Encryption Client.
B. Use dual-layer server-side encryption with AWS KMS keys (DSSE-KMS).
C. Use server-side encryption with customer-provided keys (SSE-C) before files are uploaded.
D. Use server-side encryption with AWS KMS keys (SSE-KMS).
Question #: 107
Topic #: 1
A data engineer notices that Amazon Athena queries are held in a queue before the queries run.
How can the data engineer prevent the queries from queueing?
A. Increase the query result limit.
B. Configure provisioned capacity for an existing workgroup.
C. Use federated queries.
D. Allow users who run the Athena queries to an existing workgroup.
Question #: 108
Topic #: 1
A data engineer needs to debug an AWS Glue job that reads from Amazon S3 and writes to Amazon Redshift. The data engineer enabled the bookmark feature for the AWS Glue job.
The data engineer has set the maximum concurrency for the AWS Glue job to 1.
The AWS Glue job is successfully writing the output to Amazon Redshift. However, the Amazon S3 files that were loaded during previous runs of the AWS Glue job are being reprocessed by subsequent runs.
What is the likely reason the AWS Glue job is reprocessing the files?
A. The AWS Glue job does not have the s3:GetObjectAcl permission that is required for bookmarks to work correctly.
B. The maximum concurrency for the AWS Glue job is set to 1.
C. The data engineer incorrectly specified an older version of AWS Glue for the Glue job.
D. The AWS Glue job does not have a required commit statement.
Question #: 109
Topic #: 1
An ecommerce company wants to use AWS to migrate data pipelines from an on-premises environment into the AWS Cloud. The company currently uses a third-party tool in the on-premises environment to orchestrate data ingestion processes.
The company wants a migration solution that does not require the company to manage servers. The solution must be able to orchestrate Python and Bash scripts. The solution must not require the company to refactor any code.
Which solution will meet these requirements with the LEAST operational overhead?
A. AWS Lambda
B. Amazon Managed Workflows for Apache Airflow (Amazon MVVAA)
C. AWS Step Functions
D. AWS Glue
Question #: 110
Topic #: 1
A retail company stores data from a product lifecycle management (PLM) application in an on-premises MySQL database. The PLM application frequently updates the database when transactions occur.
The company wants to gather insights from the PLM application in near real time. The company wants to integrate the insights with other business datasets and to analyze the combined dataset by using an Amazon Redshift data warehouse.
The company has already established an AWS Direct Connect connection between the on-premises infrastructure and AWS.
Which solution will meet these requirements with the LEAST development effort?
A. Run a scheduled AWS Glue extract, transform, and load (ETL) job to get the MySQL database updates by using a Java Database Connectivity (JDBC) connection. Set Amazon Redshift as the destination for the ETL job.
B. Run a full load plus CDC task in AWS Database Migration Service (AWS DMS) to continuously replicate the MySQL database changes. Set Amazon Redshift as the destination for the task.
C. Use the Amazon AppFlow SDK to build a custom connector for the MySQL database to continuously replicate the database changes. Set Amazon Redshift as the destination for the connector.
D. Run scheduled AWS DataSync tasks to synchronize data from the MySQL database. Set Amazon Redshift as the destination for the tasks.
Question #: 111
Topic #: 1
A marketing company uses Amazon S3 to store clickstream data. The company queries the data at the end of each day by using a SQL JOIN clause on S3 objects that are stored in separate buckets.
The company creates key performance indicators (KPIs) based on the objects. The company needs a serverless solution that will give users the ability to query data by partitioning the data. The solution must maintain the atomicity, consistency, isolation, and durability (ACID) properties of the data.
Which solution will meet these requirements MOST cost-effectively?
A. Amazon S3 Select
B. Amazon Redshift Spectrum
C. Amazon Athena
D. Amazon EMR
Question #: 112
Topic #: 1
A company wants to migrate data from an Amazon RDS for PostgreSQL DB instance in the eu-east-1 Region of an AWS account named Account_A. The company will migrate the data to an Amazon Redshift cluster in the eu-west-1 Region of an AWS account named Account_B.
Which solution will give AWS Database Migration Service (AWS DMS) the ability to replicate data between two data stores?
A. Set up an AWS DMS replication instance in Account_B in eu-west-1.
B. Set up an AWS DMS replication instance in Account_B in eu-east-1.
C. Set up an AWS DMS replication instance in a new AWS account in eu-west-1.
D. Set up an AWS DMS replication instance in Account_A in eu-east-1.
Question #: 113
Topic #: 1
A company uses Amazon S3 as a data lake. The company sets up a data warehouse by using a multi-node Amazon Redshift cluster. The company organizes the data files in the data lake based on the data source of each data file.
The company loads all the data files into one table in the Redshift cluster by using a separate COPY command for each data file location. This approach takes a long time to load all the data files into the table. The company must increase the speed of the data ingestion. The company does not want to increase the cost of the process.
Which solution will meet these requirements?
A. Use a provisioned Amazon EMR cluster to copy all the data files into one folder. Use a COPY command to load the data into Amazon Redshift.
B. Load all the data files in parallel into Amazon Aurora. Run an AWS Glue job to load the data into Amazon Redshift.
C. Use an AWS Give job to copy all the data files into one folder. Use a COPY command to load the data into Amazon Redshift.
D. Create a manifest file that contains the data file locations. Use a COPY command to load the data into Amazon Redshift.
Question #: 114
Topic #: 1
A company plans to use Amazon Kinesis Data Firehose to store data in Amazon S3. The source data consists of 2 MB .csv files. The company must convert the .csv files to JSON format. The company must store the files in Apache Parquet format.
Which solution will meet these requirements with the LEAST development effort?
A. Use Kinesis Data Firehose to convert the .csv files to JSON. Use an AWS Lambda function to store the files in Parquet format.
B. Use Kinesis Data Firehose to convert the .csv files to JSON and to store the files in Parquet format.
C. Use Kinesis Data Firehose to invoke an AWS Lambda function that transforms the .csv files to JSON and stores the files in Parquet format.
D. Use Kinesis Data Firehose to invoke an AWS Lambda function that transforms the .csv files to JSON. Use Kinesis Data Firehose to store the files in Parquet format.
Question #: 115
Topic #: 1
A company is using an AWS Transfer Family server to migrate data from an on-premises environment to AWS. Company policy mandates the use of TLS 1.2 or above to encrypt the data in transit.
Which solution will meet these requirements?
A. Generate new SSH keys for the Transfer Family server. Make the old keys and the new keys available for use.
B. Update the security group rules for the on-premises network to allow only connections that use TLS 1.2 or above.
C. Update the security policy of the Transfer Family server to specify a minimum protocol version of TLS 1.2
D. Install an SSL certificate on the Transfer Family server to encrypt data transfers by using TLS 1.2.
Question #: 116
Topic #: 1
A company wants to migrate an application and an on-premises Apache Kafka server to AWS. The application processes incremental updates that an on-premises Oracle database sends to the Kafka server. The company wants to use the replatform migration strategy instead of the refactor strategy.
Which solution will meet these requirements with the LEAST management overhead?
A. Amazon Kinesis Data Streams
B. Amazon Managed Streaming for Apache Kafka (Amazon MSK) provisioned cluster
C. Amazon Kinesis Data Firehose
D. Amazon Managed Streaming for Apache Kafka (Amazon MSK) Serverless
Question #: 117
Topic #: 1
A data engineer is building an automated extract, transform, and load (ETL) ingestion pipeline by using AWS Glue. The pipeline ingests compressed files that are in an Amazon S3 bucket. The ingestion pipeline must support incremental data processing.
Which AWS Glue feature should the data engineer use to meet this requirement?
A. Workflows
B. Triggers
C. Job bookmarks
D. Classifiers
Question #: 118
Topic #: 1
A banking company uses an application to collect large volumes of transactional data. The company uses Amazon Kinesis Data Streams for real-time analytics. The company’s application uses the PutRecord action to send data to Kinesis Data Streams.
A data engineer has observed network outages during certain times of day. The data engineer wants to configure exactly-once delivery for the entire processing pipeline.
Which solution will meet this requirement?
A. Design the application so it can remove duplicates during processing by embedding a unique ID in each record at the source.
B. Update the checkpoint configuration of the Amazon Managed Service for Apache Flink (previously known as Amazon Kinesis Data Analytics) data collection application to avoid duplicate processing of events.
C. Design the data source so events are not ingested into Kinesis Data Streams multiple times.
D. Stop using Kinesis Data Streams. Use Amazon EMR instead. Use Apache Flink and Apache Spark Streaming in Amazon EMR.
Question #: 119
Topic #: 1
A company stores logs in an Amazon S3 bucket. When a data engineer attempts to access several log files, the data engineer discovers that some files have been unintentionally deleted.
The data engineer needs a solution that will prevent unintentional file deletion in the future.
Which solution will meet this requirement with the LEAST operational overhead?
A. Manually back up the S3 bucket on a regular basis.
B. Enable S3 Versioning for the S3 bucket.
C. Configure replication for the S3 bucket.
D. Use an Amazon S3 Glacier storage class to archive the data that is in the S3 bucket.
Question #: 120
Topic #: 1
A telecommunications company collects network usage data throughout each day at a rate of several thousand data points each second. The company runs an application to process the usage data in real time. The company aggregates and stores the data in an Amazon Aurora DB instance.
Sudden drops in network usage usually indicate a network outage. The company must be able to identify sudden drops in network usage so the company can take immediate remedial actions.
Which solution will meet this requirement with the LEAST latency?
A. Create an AWS Lambda function to query Aurora for drops in network usage. Use Amazon EventBridge to automatically invoke the Lambda function every minute.
B. Modify the processing application to publish the data to an Amazon Kinesis data stream. Create an Amazon Managed Service for Apache Flink (previously known as Amazon Kinesis Data Analytics) application to detect drops in network usage.
C. Replace the Aurora database with an Amazon DynamoDB table. Create an AWS Lambda function to query the DynamoDB table for drops in network usage every minute. Use DynamoDB Accelerator (DAX) between the processing application and DynamoDB table.
D. Create an AWS Lambda function within the Database Activity Streams feature of Aurora to detect drops in network usage.