Provide explanaGons using the results for each clustering of the alternaGve data set.

Provide explanaGons using the results for each clustering of the alternaGve data set.

Data Analysis using R and Weka Overview The coursework is organized into three parts, each one focusing on a different and important aspect of either Data Analysis or Data Mining. All parts involve the use of the same dataset. The first part focuses on describing and visualizing the data and preparing the data for subsequent treatment (‘preBprocessing’). The second part focuses on clustering and the third part focuses on classificaGon. The main goal is to give you firstBhand experience on working with a relaGvely large and real data set, from the earliest states of data descripGon to the later stages of knowledge extracGon and predicGon.Data Set The data set is a slightly modified version of a realBworld medical data set. The data concerns the geneGc classificaGon of breast cancer tumours. Each record consists of 20 aLribute (input) columns, and one class (output) column corresponding to the informaGon about a tumour removed from a paGent during surgery. The aLributes are geneGc markers that have been determined by assessment of the tumour and the class variable is a provisional labelling of the subBtype of breast cancer. The enGre data set consists of 1076 instances (paGent cases). the header row and 15 instances of the data set. Some of the variables contain missing values, which are indicated by empty entries.Although it is possible to make use of many different soRware tools in order to answer the coursework quesGons, you are required to only use R and Weka, as indicated in the details below. 
 In order to complete this coursework you will need to submit a wriLen report describing all the analyses conducted. The maximum length for the report is 2000 words and twenty sides of A4, excluding the cover page, but including all tables and figures, minimum font size 11pt (a full page of text in a similar style to this document would contain about 500 words, so the majority of the 20 sides will be tables and figures). The report should clearly explain what you did with the data, how you did it and why you did it, and should be well structured and illustrated. Your report should contain three secGons in total as described below. You should not include any lengthy code, or raw output (e.g. the outputof R commands) in the main body of your report, but you may include these in appendices. Note that appendices will not contribute to the word count, and are not explicitly marked: they are for reference only. Marks Each of the three Parts of the coursework carries a total of 25 marks, aggregaGng to 75 marks. Marks will only be awarded for the first 25 pages of the main body of your report. Assessment Criteria The main assessment criteria for the report are: • Correctness – that is, do you apply techniques correctly; do you make correct assumpGons; do you interpret the results in an appropriate manner; etc.? • Completeness – that is, do you apply a technique only to small subsets of the data; do you apply only one technique, when there are mulGple alternaGves; do you consider all opGons; etc.?• Originality – that is, do you combine techniques in new and interesGng ways; do you make any new and/or interesGng findings with the data? • ArgumentaGon – that is, do you explain and jusGfy all of your choices? 
 Plagiarism vs. Group Discussions 
 As you should know, plagiarism is completely unacceptable and will be dealt with according to the University’s standard policies. Having said this, we do encourage students to have general discussions regarding the coursework with each other in order to promote the generaGon of new ideas and to enhance the learning experience. Please be very careful not to cross the boundary into plagiarism. The important part is that when you sit down to actually do the data analysis/mining and write about it, you do it individually. If you do this, and you truly understand what you have wriLen, you will not be guilty of plagiarism. Do NOT, under any circumstances, share code or share figures, graphs or charts, etc. As examples, saying to someone, “I used a Pivot Table in Excel to do the cross tabulaGons” is completely fine; whereas Copying & PasGng the actual Pivot Table itself would be plagiarism.
Part 1 – Description, Visualisation and PreFprocessing [R Only: 35 marks]. a) Explore the data [5] i. Use as many funcGons/techniques in R as necessary to adequately describe and visualise the data. Provide a table for all the aLributes of the dataset including the measures of centrality (mean, median etc.), dispersion and how many missing values each aLribute has. Use the table to make comments about the data.ii. Produce histograms for each aLribute. Provide details how you created the histograms and comment on the distribuGon of data. Use also the descripGve staGsGcs you produced above to help you characterise the shape of the distribuGon.. b) Explore the relaGonships between the aLributes, and between the class and the aLributes [5] i. Calculate the correlaGons between er and pgr, b1 and b2, and p1 and p2 (three 
 correlaGons). What do these tell you about the relaGonships between these variables?ii. Produce scaLerplots between the class variable and er, pgr and h1 variables (note: you 
 may have to recode the class variable as numeric to produce scaLerplots). What do these 
 tell you about the relaGonships between these three variables and the class?. c) General Conclusions [8] 
 Take into consideraGons all the descripGve staGsGcs, the visualisaGons, the correlaGons you produced together with the missing values and comment on the importance of the aLributes.Which of the aLributes seem to hold significant informaGon and which you can regard as insignificant? Provide an explanaGon for your choice.. d) Dealing with missing values in R [5]i. Write an script in R to find missing values and replace them using three strategies. Replace missing values with 0, mean and median ii. Compare and contrast these approaches f) ALribute transformaGon [6] Explore the use of three transformaGon techniques (mean centering, normalisaGon and standardisaGon) to scale the aLributes, and compare their various effects. g) ALribute / instance selecGon [6] i. StarGng again from the raw data, consider aLribute and instance deleGon strategies to deal with missing values. Choose a number of missing values per instance or per aLribute and delete instances/aLributes accordingly. Explain your choice.ii. Consider using correlaGons between aLributes to reduce the number of aLributes. Try to reduce the dataset to contain only uncorrelated aLributes.iii. Use principal component analysis in R to create a data set with ten aLributes.As a result, you will end up with several different sets of data to be used in Part 3 & 4. Give each set of data a clear and disGnct name, so that you can easily refer to again in the later stages.Part 2 – Clustering [R Only: 20 marks]
Using R (only), explore the use of clustering to find natural groupings in thedata, without using the class variable – i.e. use only the 20 numeric (input) aLributes to perform the clustering. Once the data is clustered, you may use the class variable to evaluate or interpret the results (how do the new clusters compare to the original classes?). . a) Use hierarchical, k-means, PAM as clustering algorithms to create classificaGons of seven clusters and write the results. Which algorithm produces beLer results when compared to the class aLribute? [10]. b) As each of these algorithms has adjustable parameters, you may explore the ‘opGmisaGon’ or ‘tuning’ of these parameters, either manually or (preferably) automaGcally. Which parameters produce the best results for each clustering algorithm? Provide the reasoning of the techniques you used to find the opGmal parameters. [5]. c) Choose one clustering algorithm of the above and perform this clustering on alternaGve data sets that you have produced as a result of Part 2. [5] i. The reduced data set featuring only the first 10 Principal Components. ii. The dataset aRer deleGon of instances and aLributes.iii. The three datasets aRer you replaced missing values with the three techniques.iv. Which of these datasets had a posiGve impact on the quality of the clustering? Provide explanaGons using the results for each clustering of the alternaGve data set..Part 3 – ClassiLication [Weka and R: 20 marks] You must use Weka to perform the classificaGon, but you may choose to use R to present results. Use Weka to explore the use of various classificaGon techniques to create models that predict the given class from the input aLributes. Split the data (randomly) into a training set (2/3 of the data) and a test set (containing 1/3 of the data);. a) Try using the following classificaGon algorithms: ZeroR, OneR, NaïveBayes, IBk (kBNN) and J48 (C4.5) algorithms. Which algorithm produces the best results? [10]. b) Choose one classificaGon algorithm of the above and explore various parameter senngs for each of the different splits of data. Which parameters improve the predicGve ability of the algorithm? [5]. c) Choose one classificaGon algorithm of the above and use the data sets you created in part 2 [5]: i. The reduced data set featuring only the first 10 Principal Components. ii. The dataset aRer deleGon of instances and aLributes.iii. The three datasets aRer you replaced missing values with the three techniques.iv. Which of the datasets had a good impact on the predicGve ability of the algorithm? Provide explanaGons using the results for each clustering of the alternaGve data set.
It is important not to exceed 2000 words but I added more pages for graphs and charts.

Like this:Like Loading…

"You need a similar assignment done from scratch? Our qualified writers will help you with a guaranteed AI-free & plagiarism-free A+ quality paper, Confidentiality, Timely delivery & Livechat/phone Support.


Discount Code: CIPD30



Click ORDER NOW..

order custom paper