A structured process to create datasets with nutritional information [ version 1 ; referees : awaiting peer review

There is a lack of datasets in Colombia that characterize the nutritional components and other similar information about food items. This study describes a structured process to develop datasets that captures the preferences and purchases of food items by a selected group of people. The datasets would classify products according to their sodium and sugar content.  The outcome of this structured process would include three datasets, each with a different focus: the first contains data on food preferences, the second contains the purchase history according to the invoices obtained, and the third contains characteristics of the food items such as its brand, category, sodium and sugar content levels, among others.

In this day and age, there is an impressive amount of data traffic that is generated and shared over the internet.Researchers can utilize thousands of photos, hours of video footage, and consumer data to create datasets 1 .Some datasets are used in research with a specific goal in mind, whereas other datasets are used to create data and store information for future investigations.Some datasets are freely published, while others are for restricted use.
There are several studies that use data to analyse taste preferences around online shopping 2 , music 3,4 , movies 5,6 or social relations, for example 7.However, a study about people's preferences for food items in supermarkets in Colombia faces challenges due to the lack of datasets freely available on this topic.Additionally, various products that are present in some public datasets 8 are not available in Colombia.
To address these gaps, this study describes the process of creating and describing a dataset that contains information on the food preferences and purchases of a group of people living in Colombia.An important aspect of the dataset is describing the sodium 9 and sugar 10 content of each food product and featuring and sorting out the nutritional information available in the Colombian market.

Methods
According to the STROBE guidelines, we have taken the following into consideration.
The purpose of the study is based on capturing the preferences of users in self-service stores.The study was carried out in the cities of Popayán and San Juan de Pasto, Colombia, across two months, where part of this period was used for participant recruitment.
A group of students, professionals and independent workers, all ≥ 18 years of age, accepted the invitation to participate in this research voluntarily, providing a signed agreement where they accepted to sharing their information as long as their identities would be protected and remained anonymous.
All data were analyzed and stored in text files available in the "Variables and Data sources" section.In that section, the structure and components are explained in more detail.
The study is exploratory and the aim is to obtain a dataset for future work.The general structure followed the principles outlined by Robert K. Yin 11 .Table 1 presents a summary of these elements.

Data collection
Figure 1 illustrates the process of data acquisition, carried out using two methods.The first method involved collecting preferences using a survey, and the second method involved the acquisition of purchase records with invoices.All purchases were made in self-service stores, focused particularly on food self-service stores.Data collection was implemented over a one-month period when participants were actively involved in the data collection process.

User preferences
Data collection as described above was carried out through a survey, where people chose products based on their preferences.For this task, the Google Forms web tool was used, in which a series of questions were designed and classified into twelve sections.Participants were informed of the academic purpose of the survey, and the basic demographic data of each participant was registered.They identified their preferences out of the 708 food items presented in the survey.All items were classified into ten categories, created from observation of local self-service stores.
Table 2 shows the 10 categories the items were classified under.Classification of the items aimed to have participants interact in a more comfortable and conscious way with the questions, attempting to keep the process from becoming tedious.

Plan
We want to answer the question: what are the items that people prefer when making purchases in self-service stores?

Design
The references in consumption of Items in a supermarket are selected as the unit of analysis.Type of simple case study Exploratory nature

Set Up
The structure proposed in Figure 2 Data Collection Period of data collection: July-August, 2017 analysis The data are available according to the structure proposed in Table 3, Table 4, Table 5 and Figure 1, Figure 2 Release Placing the data in the public domain by means of this article  The survey was available for one month, and was available online.
During the collection process, 215 people participated and shared their preferences and other demographic data.

User purchasing history
The purchase history refers to a list of products purchased by a person within a period of time in a self-service store.65 participants provided all of their purchase receipts for four weeks, in particular for food products.At the end of this period, all the invoices of the 65 people who participated in the study were collected.R-Studio v1.0.143.
12 was then used to transcribe the products of interest, taking into account the number of submitted receipts, non-food products, and the number of times each user purchased each item.

Data treatment
The second part of Figure 1 illustrates how the information collected from the surveys was processed to construct the datasets.The process involved manually removing irrelevant information such as repeated surveys, inconsistent data, and non-focused responses in the user preferences section.For the participants' purchase receipts, some information was also manually removed, since some receipts contained purchases other than food products.The previously filtered information in both datasets was anonymized by assigning numerical codes to the users and the products to protect users' identities and classify all the products.All food items were classified based on their sodium and sugar content (based on WHO and FDA recommendations) 9,10 .Figure 2 shows the final data structure after organizing the information 13 .

Data structure
There are two columns in Table 3.The first column ("User Code") shows the code assigned to each user, and the second column registers the products selected as each user's favorite.Each user has one or more products registered in the table, where the first four numbers represent the type of product, and the last three numbers refer to the specific brand for each product.
Similar to the previous table, Table 4 presents the same two columns, and has an additional column, which shows the ranking, or the number of times a user has purchased that product divided by the number of shopping invoices for that user over the four week period.

Results
The numbers in the second half of Figure 3 represent a scale for measuring sodium and sugar contents, based on the quantity of sodium and sugar that each product contained according to the nutritional table.To better understand the graph, we note again that there are four levels that represent the sodium content, and four levels that represent the sugar content, which generates 16 possible combinations that are color coded differently.For instance, the green circle with the number 11 indicates that the sodium and sugar contents are very low, whilst the red circle with the number 44 indicates a product with very high sodium and sugar content.The pie chart illustrates the percentage of products in each sodium and sugar classification.There is a higher percentage of products with high sodium and low sugar contents.This file contains six columns (Item_Code, Category, Section, Code_Brand, Sugar_Level, Sodium_Level).Item_Code is the code assigned to each item and the other columns represent how they have been classified and coded according to their characteristics.

Conclusions
This work was carried out to construct a valid dataset with food items available in Colombia.Future academic studies can perform statistical analysis using the data collected.Using the information from the nutritional labels of food items, we classified products using aspects like sodium and sugar content, following WHO and FDA recommendations to inform us whether the products contain levels above or below the recommended levels.

Data availability
Dataset The benefits of publishing with F1000Research: Your article is published within days, with no editorial bias You can publish traditional articles, null/negative results, case reports, data notes and more The peer review process is transparent and collaborative Your article is indexed in PubMed after passing peer review Dedicated customer support at every stage For pre-submission enquiries, contact research@f1000.com Rodriguez-Montúfar F, Ordoñez-Buitron B, Duran D How to cite this article: et al.A structured process to create datasets with nutritional 2018, :110 (doi: ) information [version 1; referees: awaiting peer review] F1000Research 7 10.12688/f1000research.12979.1 © 2018 Rodriguez-Montúfar F .This is an open access article distributed under the terms of the Copyright: et al Creative Commons Attribution , which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.Data associated Licence with the article are available under the terms of the (CC0 1.0 Public domain dedication).Creative Commons Zero "No rights reserved" data waiver The author(s) declared that no grants were involved in supporting this work.

Figure 2 .
Figure 2. Schematic of data classification.Survey_items represents the preferences of the user and Purchase_items represents the purchases themselves, along with the characteristics of each product.

Figure 3 .
Figure 3. Levels of sodium and sugars in food and drink products.

Table 5
has six columns, with each row representing a different product characteristic.Each product has an item code, the section to which the product belongs, the category to which the product belongs, brand, sugar content per 100 g, and sodium content per serving (classified into four levels, where 1 is the lowest and 4 is the highest).

Dataset 6 in: A structured process to create datasets with nutritional information.
15 User preferences.This file contains two columns (User_Code, Item_Code), the first column User_Code is the code assigned to each user and the second column Item_Code contains the encoded product that the user prefers.DOI, 10.5256/ f1000research.12979.d18837314.Dataset 2: User purchasing.This file contains three columns (User_Code, Item_Code, Rating), the first column User_Code is the code assigned to each user and the second column Item_Code contains the encoded product that the user prefers and Rating is the value obtained from dividing the number of total product invoices by the number of times the user purchased a product.DOI, 10.5256/f1000research.12979.d18837415.Brands.This file contains three columns (No, Code_ Band, Brand), Code_Brand represents the code assigned to each brand and Brand represents the brand of each product.DOI, 10.5256/f1000research.12979.d18837718 .Montufar F, Ordoñez-Buitron B, Duran D, et al.: F1000Research.2017.Data Source 20.Rodriguez-Montufar F, Ordoñez-Buitron B, Duran D, et al.: