COMP529/336.big data analysis
大数据作业代写 Any data analysis starts with data cleaning (in fact this typically takes most of the time). In this case, we should convert all
Assignment 2 is worth 20% of your mark for COMP529/336. Failure on this assignment can be compensated by higher marks on other assessments on the module. The assignment aims to assess all learning outcomes for COMP529/336. Any questions about the description of the task can only be asked via the discussion board:
Ask here for clarifications regarding Assigment 2
Do not ask me such questions by email or MS Teams: it is best to have this information all in one place. Asking for help with the assignment, feedback on or posting (partial) solutions, is strictly forbidden.
SUBMISSION 大数据作业代写
Please submit your coursework online here on CANVAS before the due date 10th Jan 2022 at 4pm (GMT). Your submission should consists of exactly 3 separate files: PART-1.doc(x)/pdf as single Word/PDF file consisting to your solution to Part 1, PART-2.doc(x)/pdf as single Word/PDF file which consists of your solution to Part 2, and finally Python code as single PART-1.py file which should include all the instructions necessary to execute this code using PySpark. Your code will be tested on the VirtualBox Image provided that has PySpark 2.4.6 and Python 3.6.9 installed, so make sure that your code works there. You should not use ZIP nor any other compression software for you submission. You can lose up to 20pt if your submission do not adhere to the expected submission format.
Standard lateness penalties will apply to any work handed in after the due time. The report must be written by yourself using your own words (see the University guidance on academic integrity for additional information). The same applies to your Python code, but you solution can be inspired by an external source as long as you clearly state that in your report and cite its source so we can see what they differ. In other words, all external sources used when working on this assignment, have to be clearly acknowledged. Your text and code will be automatically checked for plagiarism and copying using Turnitin software.
Part 1. Data Analysis using PySpark [max 120 pt]
In this assignment, we are going to use PySpark to analyse the GPS trajectory dataset that was collected in the Geolife project (Microsoft Research Asia) over a period of five years by 100+ people. Each GPS trajectory in this dataset is represented by a sequence of time-stamped points, each of which contains the information of latitude, longitude and altitude. A more detailed description of the dataset and how it was collected can be found here:
We are not going to analyse the whole data set as it is too big. Please find the relevant subset where selected individual trajectories were combined into a single file CSV and ZIPPed here:
dataset-version2.zip Download dataset-version2.zip
The first line of this file contains a header:
UserID,Latitude,Longitude,AllZero,Altitude,Timestamp,Date,Time
which is self-explanatory (you can ignore the 0s in the AllZero column). The Timestamp is the number of days (with fractional part) that have passed since 12/30/1899. You should process this data as a RDD or Spark’s DataFrame. 大数据作业代写
To simplify matters you can interpret (longitude, latitude) as a (x,y) point in 2D space and calculate the distance between two such points using the standard Euclidean distance. However, to be accurate, you should use one of the solutions presented here to calculate this distance (pick any that works for you and make sure it calculates the distance correctly based on some test example):
Now run Spark in the VirtualBox image that you made, the image available on CAVNAS or simply your own OS if you were able to install and run Spark on it.
Either way, make sure to backup your solution as often as possible. Please run Spark in the Standalone cluster mode with the number of workers set to the number of cores that your VM / laptop / remote machine have. Most of the individual tasks are independent from each other, so you can get points without solving all of them. In each task, provide the PySpark code with some explanation and the result. (No other programming language or framework other than PySpark is permitted.) In case of a tie for the value of the particular measure, give preference to users with smaller ID value.
1.Any data analysis starts with data cleaning (in fact this typically takes most of the time). In this case, we should convert all dates/times from GMT to Beijing time, where essentially all these trajectories were collected. This requires to move dates, times and timestamps by 8 hours ahead. You should not create a new input file, but instead use Spark’s map/withColumn transformation to change the RDD/DataFrame created from the original file.[20 pt] (If you find this too difficult, just skip to the next point and say so in your report. You will just simply miss out on the points for this task as a result.)
2.Calculate for each person, on how many days was the data recorded for them (count any day with at least one data point). Output the top 5 user IDs according to this measure and its value (as mentioned above, in case of a tie, output the user with the smaller ID).[20 pt]
3.Calculate for each person, on how many days there were more than 100 data points recorded for them (count any day with at least 100 data points). Output all user IDs and the corresponding value of this measure.[20 pt] 大数据作业代写
4.Calculate for each person, the highest altitude that they reached. Output the top 5 user ID according to this measure, its value and the day that was achieved (in case of a tie, output the earliest such a day).[20 pt]
5.Calculate for each person, the timespan of the observation, i.e., the difference between the highest timestamp of his/her observation and the lowest one. Output the top 5 user ID according to this measure and its value.[20 pt]
6.Calculate for each person, the distance travelled by them each day. For each user output the (earliest) day they travelled the most. Also, output the total distance travelled by all users on all days.HINT: use lag and window functions. [20 pt]
Part 2. Clustering of Trajectories [max 80 pt]
In this part we are going to work with the same data set, but not with PySpark. In fact, you should not write any code in any programming languages in this part.
Suppose that we would like to figure how many different daily travel patterns people have. To do so we should first quantify by how much any two given trajectories differ and use a suitable clustering algorithm after that. 大数据作业代写
7.Consider for each user the attached dataset consisting of day trajectories (a list of (timestamp, longitude, latitude)) for individual users. Design a suitable dissimilarity measure that can quantify how different two day trajectories are from each other. Your measure should be symmetric (i.e., d(x,y) = d(y,x) for any trajectories x,y) and if and only when two trajectories are the same its value should be 0 (i.e., d(x,y) = 0 if and only if x=y).
Argue (or prove if you like) why this measure has these properties, is suitable for this task and robust to errors in the data (e.g., duplicated timepoints, adding timepoints in the middle that do not really change the trajectory). [50 pt]. NOTE: Make sure this measure allows for a different number of observations made each day. For half a number of points (i.e., [25 pt]), you can assume that the number of observations and their timestamps are the same.
8.Is your measure is a distance metric, i.e., satisfies the triangle equality (d(x,y) <= d(x,z) + d(z,y) for any trajectories x,y,z). If so argue (or prove if you like) why and if not, show a counterexample. [10 points]
9.For each given user we would like to group days that have similar trajectories. Which of the clustering algorithms presented during the lectures would you choose in this case and why? [20 points]
发表回复
要发表评论,您必须先登录。