Cloning data with unchanged estimates of estimable non- linear functions of parameters [version 1; peer review: awaiting peer review]

Non-linear regression models occur in the fields of biology, banking, economics, and sociology, population and biological growth. The absolute growth, growth of humans, and most importantly, an economic variable is appropriately described by non-linear regression models. In this article, we present cloned datasets for bivariate and multivariate non-linear regression models with the same non-linear regression fit. The application of such cloned datasets is used for the confidentiality of sensitive data for publication purposes. In this article, we present cloned data sets which will yield the same fitted non-linear regression models.


Introduction
In situations where the data is confidential and cannot be shown, there is a need for an alternative or matching set of data that can play the role of the actual data. The alternative or matching set of data is called cloned data. Therefore, cloned datasets can give a model-free way of representing confidential data. One possible use of these naturally cloned datasets is for the confidentiality of sensitive data for publication purposes, where having data sets with the same fit as the original data is the main advantage. Anscombe (1973) provided four cloned datasets to show the significance of graphs in a statistical study. All these cloned datasets have identical summary statistics (e.g., mean, variance, and correlation) but different data graphics (scatter plots). Chatterjee and Firat (2007) presented a technique of producing different (bivariate) datasets with the same summary statistics but dissimilar graphs by applying a genetic algorithm-based method. The idea of generating cloned data that has the same fit for simple and multiple linear regressions has been explained by Govindaraju (2009, 2012) and Govindaraju and Haslett (2008). Govindaraju and Haslett (2008) gave the idea of cloning datasets by using the simple linear regression in the bivariate case. In all the cases, regression estimates are the same, and the variability decreases in the first to the next iteration. Haslett and Govindaraju (2009) explained the procedure for generating matching or cloning datasets for multivariate case.
The procedure by Haslett and Govindaraju gives a substitute way of presenting confidential data so that statistical analysis of multiple regression has the same fit in the original as well as in the cloned data. However, the data have been changed to be not any more confidential. The advantage is that parameter estimates from the cloned data and the original data do not consist of any model error. Haslett and Govindaraju (2012) regarded the issue of how to enhance the algorithm of producing several cloned datasets that will generate the same fitted regression equations. The primary method described that the fitted slope and intercept are merely estimates and that somewhat dissimilar datasets can still generate the same estimates. Anscombe (1973) showed using four different fictious data sets (see Table 1) and showed that the regression estimates and their standard errors obtained are similar with their graphical significance in the literature (presented in Figure 1) but could not elaborate how such data have been obtained. Chatterjee and Firat (2007) used the data given in Anscombe (1973). They showed that all four datasets have identical summary statistics but different graphs by using the algorithm of the same datasets. Govindaraju and Haslett (2008) explained the procedure to generate cloned data for a simple linear regression y i = a + bx i + e i model as follows, assuming.
1) Consider n pairs of observations for X and Y, i.e.,(x i , y i ) , obtain b X by simple regression of Y on X, also obtain by simple inverse regression.
2) The simple regression of X has the same parameters.
3) Further, obtain b b X by inverse regression and observe simple regression b b Y on b b X Has the same parameters. Iterating steps 1 & 2 can generate several fictitious or cloned datasets. They used the datasets given in Anscombe (1973) to generate several cloned datasets and showed that all cloned datasets giving the same regression estimates for first, second, third, fifth, and tenth iteration but different scatter plots. They also identify that S 2 datasets generated by four fictitious datasets given in Anscombe (1973) provided the same mean of X and Y, the correlation between X and Y, coefficient of determination R 2 , adjusted R 2 , regression fit, and standard error of the slope. But the variance of X and Y, standard error of residual and standard error of intercept decreases as the iterations increase. It shows regression towards the mean, i.e. every next cloned dataset is closer to the mean. Haslett and Govindaraju (2009) explained the procedure for generating cloned or matched datasets for a multivariate case that has the same fit. They consider identically independently distributed data for multiple regression models Where Y is the vector of responses, X = (X 1 , X 2 , …, X p ) Is the n Â p design matrix, β is the unknown p Â 1 vector of parameters, and ε is the n Â 1 vector of errors. The OLS estimate β is They used mean corrected form for response variable and independent variables x 1 , x 2 , …, x p . Because of mean correction, the above multiple regression models can be written as They explained the procedure in six steps and generated y new , x 1, new , …, x p, new which have the same fit as the original model. The cloned dataset generated by Haslett and Govindaraju (2009) gave the same regression fit, the sample mean of Y, X1, X2 but the variance of Y, X1, X2, and residual standard error less than that of raw data. Haslett and Govindaraju (2012)  developed cloning algorithms for simple and multiple linear regression models. They fit the linear regression model of y new on x (where x and y are mean-centred) on the original data and find its estimates and residuals. The residuals are added to data y one by one to create n 2 data points then fit the linear regression model of y new and x to find its estimates, resulting in identical estimates for the original datasets and the cloned datasets. The above cloning algorithm can also be used in the multivariate case. In both cases, the parameter estimates of original datasets and cloned datasets are similar. They explained the following methods to generate cloned datasets • Cloning via supplementing data by zero-mean additions-bivariate case • Cloning via supplementing data by zero-mean additions-multivariate case • Bivariate data cloning by regression y on x and x on y Cloning for multiple regression via pivots Here, we use the model presented in Haslett and Govindaraju (2012) to provide cloned datasets for bivariate and multivariate non-linear regression models with the same non-linear regression fit.

Methods
We consider the non-linear regression of y on X, where both X and y are non-mean-centered with data points. R software was used for all analysis.

Cloning for bivariate non-linear regression
In general, the non-linear regression model is with y being the response variable, X the covariate data design matrix, which is often controlled by the researcher, β the model parameters characterizing the relationship between X and y through the regression function h, and ε the model errors that are assumed to be normally distributed with zero mean and unknown variances σ 2 .
When the regression function h is linear in the parameters β, it leads to linear regression analysis. However, linear models are not always appropriate, so one often needs to apply a non-linear regression model where h is nonlinear in β.
Like in linear regression, non-linear regression provides parameter estimates based on the least square criterion. However, unlike linear regression, no explicit mathematical solution is available and specific algorithms are needed to solve the minimization problem, involving iterative numerical approximations. Here, since this is a bivariate non-linear regression on non-mean corrected data when X = x a column vector. In general, provided X is of full row rank, the ordinary least square estimate of β is, of course, b β.
Now add the residuals r ¼b ε ¼ y À h x; b β from the model fit the data , so that the original data are replicated as block n times to create an n 2 Â 1 vector and to each block, one of the residuals is added. The first block is y + 1r 1 where is a vector of 1's and r 1 the first residual. The data are now 1 ⊗ y + r ⊗ 1, and the design matrix becomes 1 ⊗ x . On noting that the model is still the same, i.e., a bivariate non-linear regression, if 1 ⊗ y + r ⊗ 1 are now regressed on 1 ⊗ x , the OLS estimate becomes e β which is equal to b β. Thus, the non-linear regression estimates for cloned data are unchanged because the sum of the residuals 1 T r is zero. Software R has been used to obtain the numerical results.
Anything can be added, i.e., if {a l : l = 1, 2, …, m} is added to each data point in the set {y i : i = 1, 2, …, n} then the condition is that ∑a l = 0. Some additions are more useful than others.
Example 1: The following cloned dataset (Table 2) is generated from the dataset X= (1,2,3,4,5,6) T and Y= (2.98, 4.26, 5.21, 6.10, 6.80, 7.50) T for the nonlinear regression model Y = aX b , a geometric or power curve. The parameter estimates for this cloned data set are summarized in Table 2b. which can be suitable for the data used in different fields of life if our plotted data shows the form of model Y = aX b . It can be observed that the estimates obtained by cloning procedure in Table 2b are some of the actual estimates.
Example 2: The cloned dataset (Table 3) is generated from the dataset X= (0, 1, 2, 3, 4, 5, 6, 7, 8) T and Y= (0.75, 1.20, 1.75, 2.50, 3.45, 4.70, 6.20, 8.25, 11.50) T for the nonlinear regression model =ab X , an exponential curve. If the sensitive observed data shows the exponential curve (Y = ab X ) such procedure can be useful for cloning of data. It can be observed that the estimates obtained by cloning procedure in Table 3b are similar to the actual estimates.
Example 3: The cloned dataset (Table 4) is generated from the dataset X= (1, 2, 3, 4, 5, 6) T and Y= (1.6, 4.5, 13.8, 40.2, 125.0, 363.0) T for the nonlinear regression model Y = ae bX , an exponential curve. If the sensitive data shows the nonlinear regression shape of Y = ae bX , then such cloning procedure would be helpful. It can be observed that the estimates obtained by cloning procedure in Table 4b are equal to the actual estimates.
Example 5: The cloned dataset (Table 6)  , the Makeham curve. The observed sensitive data shows the non-linear regression shape of makeham curve, then such cloning procedure would be beneficial as the estimates are closed. It can be observed that the estimates obtained by cloning procedure in Table 6b are some of the actual estimates.
Example 6: The cloned dataset (Table 7) is generated from the dataset X= (0, 1, 2, 3, 4, 5, 6, 7, 8) T and Y= (0.75, 1.20, 1.75, 2.50, 3.45, 4.70, 6.20, 8.25, 11.50) T for the nonlinear regression model =k + ab X , a modified exponential curve. Sensitive data showing the pattern of modified exponential curve, procedure explained above with the help of table and their estimates would be beneficial. It can be observed that the estimates obtained by cloning procedure in Table 7b are equal to the actual estimates.
Example 7: The following cloned dataset (Table 8) is generated from the dataset X= (0, 1, 2, 3, 4, 5, 6,7, 8) T and Y= (1225, 2879, 4994, 11525, 16190, 22573, 30677, 38517, 39003) T for the nonlinear regression model Y ¼ k 1þbc X , the Logistic curve. If the curve of observed data is in the form of logistic, then Table 8 procedure for cloning the data would be suitable. It can be observed that the estimates obtained by cloning procedure in Table 8b are identical as the actual estimates.

Cloning for multivariate non-linear regression
The algebra for the bivariate non-linear regression is unaltered for multivariate non-linear regression, except that the matrix X becomes X nÂp ¼ x , and the parameter vector and its estimates become (p + 1) Â 1 vector, b β, e β, and β.

Conclusions
In this article, we presented a cloned dataset for bivariate and multivariate non-linear regression models with the same nonlinear regression fit. The application of such cloned datasets is for maintaining the confidentiality of sensitive real data for publication purposes. In this context, new methods can be developed so that cloning is possible for non-linear regression models. The question this study addresses is how cloning techniques are better than simulation and re-sampling. The simulation approach assumes that the model is known and then generates random data from the distribution of the response variable to illustrate the sampling variability in the estimates, re-sampling estimates the precision of sample statistics by using a subset of available data or drawing randomly with replacement from a set of data points. Unfortunately, these approaches do not help to explain the concept of regression or the idea of 'moving towards' the mean. The methods presented in this study are intended to fill this gap by yielding a sequence of matching data sets with the same fitted regression equation, for which the variability in the response variable Y and the explanatory variable X will progressively reduce. The tendency of moving towards the means rather than the conditional mean are also demonstrated.

Data Availability
All data underlying the results are available as part of the article and no additional source data are required.