Project Number
ComS 17-01
Project Title
Analysis and visualisation of Merlot tasting notes
Project Leader
Fischer, B
Institution
Stellenbosch University. Division of Computer Science
Team Members
Van Zyl, P
Project Description
Sensory data such as tasting notes and wine reviews capture the essence of “good” and “bad” wines – but only in aggregation: a single review may focus on specifics of an individual wine that is not characteristic for the varietal. It is thus necessary to analyse and aggregate these textual descriptions and to correlate them with (subjective) quality metrics such as ratings. However, in order to identify dominant characteristics that correlate with the different rating categories, it is necessary to aggregate along different dimensions (e.g., origin, vintage, …).
We used the ConceptCloud (http://conceptcloud.herokuapp.com/) data exploration tool to analyze the reviews of all single-varietal Merlots in Platter’s By Diners Club South AfricanWine Guides from 2003 to 2016. We extracted the relevant fields (primarily origin, vintage, star rating, and the free-text review) for each wine from the SQL database, and used the Stanford corenlp natural language processing system to extract meaningful phrases from the review texts. We manually cleaned these phrases before we built a formal context table and used the ConceptCloud system to explore this data set. In particular, we constructed and analyzed word clouds that show the distribution of vintages, origins, and taste description phrases within the different rating categories.
We confirmed that ConceptCloud is stable enough to handle data sets such as the one analyzed here. We saw no obstacles to scaling this up to larger collections, e.g., the full set of wine reviews for all varietals. We identified several characteristic traits of both “good” resp. “bad” Merlots; these are detailed in Sections 5.3 and 5.4 of the attached technical report.
We identified several smaller shortcomings with ConceptCloud’s original visualization and implemented some improvements already. We found that the natural language processing component, however, needs substantial improvements to extract more information (in particular, more consistent key phrases) from the reviews. We suggest to integrate the phrase extraction with an ontology, taxonomy, or controlled vocabulary to achieve this. We suggest to integrate more data sources into the underlying data set that was used to build the formal context table; in particular, we suggest to re-integrate barrelling information (which we purposefully excluded because it interferes with the taste descriptors), because the barrelling process is under immediate control of the wine maker. We could also integrate chemical analysis data, where available.