CCA175試験無料問題集「Cloudera CCA Spark and Hadoop Developer 認定」

CORRECT TEXT
Problem Scenario 95 : You have to run your Spark application on yarn with each executor
Maximum heap size to be 512MB and Number of processor cores to allocate on each executor will be 1 and Your main application required three values as input arguments V1
V2 V3.
Please replace XXX, YYY, ZZZ
./bin/spark-submit -class com.hadoopexam.MyTask --master yarn-cluster--num-executors 3
--driver-memory 512m XXX YYY lib/hadoopexam.jarZZZ
正解:
See the explanation for Step by Step Solution and configuration.
Explanation:
Solution
XXX: -executor-memory 512m YYY: -executor-cores 1
ZZZ : V1 V2 V3
Notes : spark-submit on yarn options Option Description
archives Comma-separated list of archives to be extracted into the working directory of each executor. The path must be globally visible inside your cluster; see Advanced
Dependency Management.
executor-cores Number of processor cores to allocate on each executor. Alternatively, you can use the spark.executor.cores property, executor-memory Maximum heap size to allocate to each executor. Alternatively, you can use the spark.executor.memory-property.
num-executors Total number of YARN containers to allocate for this application.
Alternatively, you can use the spark.executor.instances property. queue YARN queue to submit to. For more information, see Assigning Applications and Queries to Resource
Pools. Default: default.
CORRECT TEXT
Problem Scenario 71 :
Write down a Spark script using Python,
In which it read a file "Content.txt" (On hdfs) with following content.
After that split each row as (key, value), where key is first word in line and entire line as value.
Filter out the empty lines.
And save this key value in "problem86" as Sequence file(On hdfs)
Part 2 : Save as sequence file , where key as null and entire line as value. Read back the stored sequence files.
Content.txt
Hello this is ABCTECH.com
This is XYZTECH.com
Apache Spark Training
This is Spark Learning Session
Spark is faster than MapReduce
正解:
See the explanation for Step by Step Solution and configuration.
Explanation:
Solution :
Step 1 :
# Import SparkContext and SparkConf
from pyspark import SparkContext, SparkConf
Step 2:
#load data from hdfs
contentRDD = sc.textFile(MContent.txt")
Step 3:
#filter out non-empty lines
nonemptyjines = contentRDD.filter(lambda x: len(x) > 0)
Step 4:
#Split line based on space (Remember : It is mandatory to convert is in tuple} words = nonempty_lines.map(lambda x: tuple(x.split('', 1))) words.saveAsSequenceFile("problem86")
Step 5: Check contents in directory problem86 hdfs dfs -cat problem86/part*
Step 6 : Create key, value pair (where key is null)
nonempty_lines.map(lambda line: (None, Mne}).saveAsSequenceFile("problem86_1")
Step 7 : Reading back the sequence file data using spark. seqRDD =
sc.sequenceFile("problem86_1")
Step 8 : Print the content to validate the same.
for line in seqRDD.collect():
print(line)
CORRECT TEXT
Problem Scenario 75 : You have been given MySQL DB with following details.
user=retail_dba
password=cloudera
database=retail_db
table=retail_db.orders
table=retail_db.order_items
jdbc URL = jdbc:mysql://quickstart:3306/retail_db
Please accomplish following activities.
1. Copy "retail_db.order_items" table to hdfs in respective directory p90_order_items .
2. Do the summation of entire revenue in this table using pyspark.
3. Find the maximum and minimum revenue as well.
4. Calculate average revenue
Columns of ordeMtems table : (order_item_id , order_item_order_id ,
order_item_product_id, order_item_quantity,order_item_subtotal,order_
item_subtotal,order_item_product_price)
正解:
See the explanation for Step by Step Solution and configuration.
Explanation:
Solution :
Step 1 : Import Single table .
sqoop import --connect jdbc:mysql://quickstart:3306/retail_db -username=retail_dba - password=cloudera -table=order_items --target -dir=p90 ordeMtems --m 1
Note : Please check you dont have space between before or after '=' sign. Sqoop uses the
MapReduce framework to copy data from RDBMS to hdfs
Step 2 : Read the data from one of the partition, created using above command. hadoop fs
-cat p90_order_items/part-m-00000
Step 3 : In pyspark, get the total revenue across all days and orders. entire TableRDD = sc.textFile("p90_order_items")
#Cast string to float
extractedRevenueColumn = entireTableRDD.map(lambda line: float(line.split(",")[4]))
Step 4 : Verify extracted data
for revenue in extractedRevenueColumn.collect():
print revenue
#use reduce'function to sum a single column vale
totalRevenue = extractedRevenueColumn.reduce(lambda a, b: a + b)
Step 5 : Calculate the maximum revenue
maximumRevenue = extractedRevenueColumn.reduce(lambda a, b: (a if a>=b else b))
Step 6 : Calculate the minimum revenue
minimumRevenue = extractedRevenueColumn.reduce(lambda a, b: (a if a<=b else b))
Step 7 : Caclculate average revenue
count=extractedRevenueColumn.count()
averageRev=totalRevenue/count
CORRECT TEXT
Problem Scenario 37 : ABCTECH.com has done survey on their Exam Products feedback using a web based form. With the following free text field as input in web ui.
Name: String
Subscription Date: String
Rating : String
And servey data has been saved in a file called spark9/feedback.txt
Christopher|Jan 11, 2015|5
Kapil|11 Jan, 2015|5
Thomas|6/17/2014|5
John|22-08-2013|5
Mithun|2013|5
Jitendra||5
Write a spark program using regular expression which will filter all the valid dates and save in two separate file (good record and bad record)
正解:
See the explanation for Step by Step Solution and configuration.
Explanation:
Solution :
Step 1 : Create a file first using Hue in hdfs.
Step 2 : Write all valid regular expressions sysntex for checking whether records are having valid dates or not.
val regl =......(\d+)\s(\w{3})(,)\s(\d{4}).......r//11 Jan, 2015
val reg2 =......(\d+)(U)(\d+)(U)(\d{4})......s II 6/17/2014
val reg3 =......(\d+)(-)(\d+)(-)(\d{4})""".r//22-08-2013
val reg4 =......(\w{3})\s(\d+)(,)\s(\d{4})......s II Jan 11, 2015
Step 3 : Load the file as an RDD.
val feedbackRDD = sc.textFile("spark9/feedback.txt"}
Step 4 : As data are pipe separated , hence split the same. val feedbackSplit = feedbackRDD.map(line => line.split('|'))
Step 5 : Now get the valid records as well as , bad records.
val validRecords = feedbackSplit.filter(x =>
(reg1.pattern.matcher(x(1).trim).matches|reg2.pattern.matcher(x(1).trim).matches|reg3.patt ern.matcher(x(1).trim).matches | reg4.pattern.matcher(x(1).trim).matches)) val badRecords = feedbackSplit.filter(x =>
!(reg1.pattern.matcher(x(1).trim).matches|reg2.pattern.matcher(x(1).trim).matches|reg3.pat tern.matcher(x(1).trim).matches | reg4.pattern.matcher(x(1).trim).matches))
Step 6 : Now convert each Array to Strings
val valid =vatidRecords.map(e => (e(0),e(1),e(2)))
val bad =badRecords.map(e => (e(0),e(1),e(2)))
Step 7 : Save the output as a Text file and output must be written in a single tile, valid.repartition(1).saveAsTextFile("spark9/good.txt") bad.repartition(1).saveAsTextFile("sparkS7bad.txt")
CORRECT TEXT
Problem Scenario 84 : In Continuation of previous question, please accomplish following activities.
1. Select all the products which has product code as null
2. Select all the products, whose name starts with Pen and results should be order by Price descending order.
3. Select all the products, whose name starts with Pen and results should be order by
Price descending order and quantity ascending order.
4. Select top 2 products by price
正解:
See the explanation for Step by Step Solution and configuration.
Explanation:
Solution :
Step 1 : Select all the products which has product code as null
val results = sqlContext.sql(......SELECT' FROM products WHERE code IS NULL......) results. showQ val results = sqlContext.sql(......SELECT * FROM products WHERE code = NULL ",,M ) results.showQ
Step 2 : Select all the products , whose name starts with Pen and results should be order by Price descending order. val results = sqlContext.sql(......SELECT * FROM products
WHERE name LIKE 'Pen %' ORDER BY price DESC......)
results. showQ
Step 3 : Select all the products , whose name starts with Pen and results should be order by Price descending order and quantity ascending order. val results = sqlContext.sql('.....SELECT * FROM products WHERE name LIKE 'Pen %' ORDER BY price DESC, quantity......) results. showQ
Step 4 : Select top 2 products by price
val results = sqlContext.sql(......SELECT' FROM products ORDER BY price desc
LIMIT2......}
results. show()
CORRECT TEXT
Problem Scenario 60 : You have been given below code snippet.
val a = sc.parallelize(List("dog", "salmon", "salmon", "rat", "elephant"}, 3} val b = a.keyBy(_.length) val c = sc.parallelize(List("dog","cat","gnu","salmon","rabbit","turkey","woif","bear","bee"), 3) val d = c.keyBy(_.length) operation1
Write a correct code snippet for operationl which will produce desired output, shown below.
Array[(lnt, (String, String))] = Array((6,(salmon,salmon)), (6,(salmon,rabbit)),
(6,(salmon,turkey)), (6,(salmon,salmon)), (6,(salmon,rabbit)),
(6,(salmon,turkey)), (3,(dog,dog)), (3,(dog,cat)), (3,(dog,gnu)), (3,(dog,bee)), (3,(rat,dog)),
(3,(rat,cat)), (3,(rat,gnu)), (3,(rat,bee)))
正解:
See the explanation for Step by Step Solution and configuration.
Explanation:
solution:
b.join(d).collect
join [Pair]: Performs an inner join using two key-value RDDs. Please note that the keys must be generally comparable to make this work. keyBy : Constructs two-component tuples
(key-value pairs) by applying a function on each data item. The result of the function becomes the data item becomes the key and the original value of the newly created tuples.
CORRECT TEXT
Problem Scenario 45 : You have been given 2 files , with the content as given Below
(spark12/technology.txt)
(spark12/salary.txt)
(spark12/technology.txt)
first,last,technology
Amit,Jain,java
Lokesh,kumar,unix
Mithun,kale,spark
Rajni,vekat,hadoop
Rahul,Yadav,scala
(spark12/salary.txt)
first,last,salary
Amit,Jain,100000
Lokesh,kumar,95000
Mithun,kale,150000
Rajni,vekat,154000
Rahul,Yadav,120000
Write a Spark program, which will join the data based on first and last name and save the joined results in following format, first Last.technology.salary
正解:
See the explanation for Step by Step Solution and configuration.
Explanation:
Solution :
Step 1 : Create 2 files first using Hue in hdfs.
Step 2 : Load all file as an RDD
val technology = sc.textFile(Msparkl2/technology.txt").map(e => e.splitf',")) val salary = sc.textFile("spark12/salary.txt").map(e => e.split("."))
Step 3 : Now create Key.value pair of data and join them.
val joined = technology.map(e=>((e(0),e(1)),e(2))).join(salary.map(e=>((e(0),e(1)),e(2))))
Step 4 : Save the results in a text file as below.
joined.repartition(1).saveAsTextFile("spark12/multiColumn Joined.txt")
CORRECT TEXT
Problem Scenario 72 : You have been given a table named "employee2" with following detail.
first_name string
last_name string
Write a spark script in python which read this table and print all the rows and individual column values.
正解:
See the explanation for Step by Step Solution and configuration.
Explanation:
Solution :
Step 1 : Import statements for HiveContext from pyspark.sql import HiveContext
Step 2 : Create sqIContext sqIContext = HiveContext(sc)
Step 3 : Query hive
employee2 = sqlContext.sql("select' from employee2")
Step 4 : Now prints the data for row in employee2.collect(): print(row)
Step 5 : Print specific column for row in employee2.collect(): print( row.fi rst_name)
CORRECT TEXT
Problem Scenario 8 : You have been given following mysql database details as well as other info.
Please accomplish following.
1. Import joined result of orders and order_items table join on orders.order_id = order_items.order_item_order_id.
2 . Also make sure each tables file is partitioned in 2 files e.g. part-00000, part-00002
3 . Also make sure you use orderid columns for sqoop to use for boundary conditions.
正解:
See the explanation for Step by Step Solution and configuration.
Explanation:
Solutions:
Step 1 : Clean the hdfs file system, if they exists clean out.
hadoop fs -rm -R departments
hadoop fs -rm -R categories
hadoop fs -rm -R products
hadoop fs -rm -R orders
hadoop fs -rm -R order_items
hadoop fs -rm -R customers
Step 2 : Now import the department table as per requirement.
sqoop import \
--connect jdbc:mysql://quickstart:3306/retail_db \
-username=retail_dba \
-password=cloudera \
-query="select' from orders join order_items on orders.orderid =
order_items.order_item_order_id where \SCONDITlONS" \
-target-dir /user/cloudera/order_join \
-split-by order_id \
--num-mappers 2
Step 3 : Check imported data.
hdfs dfs -Is order_join
hdfs dfs -cat order_join/part-m-00000
hdfs dfs -cat order_join/part-m-00001
CORRECT TEXT
Problem Scenario 16 : You have been given following mysql database details as well as other info.
user=retail_dba
password=cloudera
database=retail_db
jdbc URL = jdbc:mysql://quickstart:3306/retail_db
Please accomplish below assignment.
1. Create a table in hive as below.
create table departments_hive(department_id int, department_name string);
2. Now import data from mysql table departments to this hive table. Please make sure that data should be visible using below hive command, select" from departments_hive
正解:
See the explanation for Step by Step Solution and configuration.
Explanation:
Solution :
Step 1 : Create hive table as said.
hive
show tables;
create table departments_hive(department_id int, department_name string);
Step 2 : The important here is, when we create a table without delimiter fields. Then default delimiter for hive is ^A (\001). Hence, while importing data we have to provide proper delimiter.
sqoop import \
-connect jdbc:mysql://quickstart:3306/retail_db \
~ username=retail_dba \
-password=cloudera \
--table departments \
--hive-home /user/hive/warehouse \
-hive-import \
-hive-overwrite \
--hive-table departments_hive \
--fields-terminated-by '\001'
Step 3 : Check-the data in directory.
hdfs dfs -Is /user/hive/warehouse/departments_hive
hdfs dfs -cat/user/hive/warehouse/departmentshive/part'
Check data in hive table.
Select * from departments_hive;