big data analysis
大数据分析Assignment代写 I just want to make sure that I did not misunderstand the " robust to errors in the data (e.g., duplicated timepoints
Your solution should run correctly in the Standalone cluster mode on a single node (VM / laptop / remote machine):
Normally, one would develop a solution using local mode first and then test it in the Standalone cluster mode in the end. That said, any solution that works in one of these modes should almost always work in the other. We are not going to mark down your solution no matter which of these modes you will choose in the end.
Shall we add some environment variables in our Canvas VirtualBox Image? It seems I can't import the pyspark package in both IDE or shell.(I already changed the JAVA version)?
I highly recommend to simply use pyspark and any text editor to develop your solution.
Nevertheless, if you really really need to use IDE and python read:
Please keep in mind though that your final code should not use findspark (it will be tested using spark-submit).
Can I use dataframe from spark? and List from python?
Dataframe yes, as already stated in the assignment.
Regarding List it depends what you use it for. Your solution should use RDDs/DataFrames and run on a cluster, so standard python data transformations would not fit into that. 大数据分析Assignment代写
So, if for outputting or collecting all User IDs & their Altitudes for example, I use a Python List, this would not be acceptable ?
If not, what about if we parallelize it and it then becomes an RDD?
In order to print the result you will have to use collect() or another similar action at the end of the data processing. Your solution should scale to GBs of data, so you should avoid using such actions at any intermediate stage of computation. You can assume that a list of all User IDs and their corresponding altitudes fits into the memory, but any single trajectory is too big to fit and should be operated using RDDs/dataframes API only.
If you parallelize a list, it would indeed become an RDD. 大数据分析Assignment代写
Can we use pandas to help with my assignment??
You should only use RDDs/spark dataframes API to process the data. Otherwise your solution would run out of memory on more realistic datasets (GBs).
That said, you can use whatever functions you need on small fragments of the data (O(#users), ie., at most linear in the number of users in the dataset).
For the questions after Q1, should we use the updated data frame or always use the original dataframe to get our results? 大数据分析Assignment代写
Please output the results of the other questions for the updated dataframe after all the dates and times were shifted.
(When skipping Q1, please output the result for the original dataframe instead.)
For the proof of the triangle inequality is it okay to copy this proof from somewhere as long as we reference where we got it from? Would we lose any marks from doing it this way?
You should try to write that proof in your own words and give a reference to its source.
(In general, in anything that you write, you should always avoid verbatim copying longer texts.)
Hello, does it mean that we are free to use pyspark sql functions like window functions, data and timestamp functions ? I have noticed that we can use lag and window functions.
You can use pyspark sql functions if you need to (especially in Q6), but try to use just the basic API whenever possible. 大数据分析作业代写
I just want to make sure that I did not misunderstand the " robust to errors in the data (e.g., duplicated timepoints, adding timepoints in the middle that do not really change the trajectory)." in the sentence "Argue (or prove if you like) why this measure has these properties, is suitable for this task and robust to errors in the data (e.g., duplicated timepoints, adding timepoints in the middle that do not really change the trajectory). "., which is the requirement of part 2 question 7.
Suppose we have two series, using the tuple (time, latitude, longitude) to denote the elements of the trajectories. one is trajectory a is [(1,1,1),(2,1,1),(3,2,2)] while the other trajectory b is [(1,1,1), (3,2,2)]. I was wondering in this case is the dissimilarity of the two trajectories are 0 or not?
I would also appreciate it if you would like to give any further explanations or examples of the robustness of the algorithm.
Answer :Under reasonable assumptions
[(1,1,1),(2,1,1),(3,2,2)] is different than [(1,1,1),(3,2,2)] so the measure should be non-zero.
On the other hand, [(1,1,1),(1,1,1),(3,2,2)] and [(1,1,1),(3,2,2)] are clearly the same so the measure should be zero.
You should also try to handle inconsistent data, e.g., [(1,1,1),(1,2,2)]. It is really up to you how to deal with such cases.
A more interesting case is whether
[(1,1,1),(2,2,2),(3,3,3)] is different than [(1,1,1),(3,3,3)]
If you assume that the trajectory is a line connecting points (and the movement happends at the same speed between them) then clearly these two are the same, but it is really up to you what assumption you would like to make and how detailed you would like to study this.
There are many many ways how to define such a measure and each comes with it own advantages and disadvantages. 大数据分析Assignment代写
I just have some quick follow-up questions.
(1) I wonder if the reason we argue "[(1,1,1),(1,1,1),(3,2,2)] and [(1,1,1),(3,2,2)] are clearly the same" is that (1,1,1) is duplicated and after removal the duplicate sequence are the same?
(2) For part2, can I understand it in this way: there is no perfect dissimilarity in all cases but there is best dissimilarity measure under specific cases?
(1) Yes. It is fine if you define your measure formally for most trajectories satisfying some reasonable assumptions (which you would need to clearly state) and how to deal with corner cases or inconsistencies, duplicated points separately (otherwise the formal definition may be a mess).
(2) There is no best dissimilarity measure really. One can probably always come up with a case that a particular measure cannot handle well, but another more complex one would, but that new measure may be worse in other cases instead.
How should our Report look? Should we have headings for each Task and list how we got our answers / what the answers were? What tasks need to be inside of the solution document?
Just make a title (module, assigment, part), state your name, student id, and for each question you can simply have Qx, where x is the number of the question. For each question, state your solution with some suitable explanation how it works so one can easily follow.
Should we have
if name == "main":
in our python file?
also, can you tell us how to submit a python file in our Visual box? 大数据分析Assignment代写
when I try:
spark-submit /home/comp529/Desktop/COMP529_Assignment 2.py
I got : Exception in thread "main" org.apache.spark.SparkException: Cannot load main class from JAR file:/home/comp529/Desktop/COMP529_Assignment
You may try ./bin/spark-submit --master local[*] <filename>.py . Also, before using spark-submit, you need to cd to /usr/local/spark first.
发表回复
要发表评论,您必须先登录。