Crowd density estimation using deep learning for Hajj pilgrimage video analytics

Background: This paper focuses on advances in crowd control study with an emphasis on high-density crowds, particularly Hajj crowds. Video analysis and visual surveillance have been of increasing importance in order to enhance the safety and security of pilgrimages in Makkah, Saudi Arabia. Hajj is considered to be a particularly distinctive event, with hundreds of thousands of people gathering in a small space, which does not allow a precise analysis of video footage using advanced video and computer vision algorithms. This research proposes an algorithm based on a Convolutional Neural Networks model specifically for Hajj applications. Additionally, the work introduces a system for counting and then estimating the crowd density. Methods: The model adopts an architecture which detects each person in the crowd, spots head location with a bounding box and does the counting in our own novel dataset (HAJJ-Crowd). Results: Our algorithm outperforms the state-of-the-art method, and attains a remarkable Mean Absolute Error result of 200 (average of 82.0 improvement) and Mean Square Error of 240 (average of 135.54 improvement). Conclusions: In our new HAJJ-Crowd dataset for evaluation and testing, we have a density map and prediction results of some standard methods.


Introduction
Hajj has been used as an opportunity for certain rituals. The Hajj is linked to the life of the Islamic prophet Muhammad, who lived in the seventh century AD, although Muslims believe that the tradition of pilgrimage to Mecca dates all the way back to Abraham's time. 1 For four to five days a year, over two million pilgrims from several parts of the world come to Mecca, where they tour the many places in Mecca and perform rituals. 2 Each ritual has a short but challenging path to take. The Hajj authorities have confirmed that they are having difficulties in monitoring crowd density, which can be seen from the tragedies that occurred in September 2015. 3 Regression-based approaches are normally used to estimate crowd density, to infer a mapping between lower-level capabilities and crowd evaluation. 1,2 In this paper, we propose a method for crowd analysis and density estimation using deep learning. The benefit of the Convolutional Neural Network (CNN) model is that it is superior than handcrafted features in identifying crowd-specific characteristics. We propose a framework for crowd counting based on convolutional neural networks (CNNs) in this study. 2 Our aim is to analyze the map of crowd videos and then use visualization for cross-scene crowd analysis in unseen target scenes. To do this, we must overcome the following obstacles: The challenge of prevailing multitude analysis is insufficient to help in the comparison of research into scene analysis. [4][5][6] The main contributions of this research include: 1. A methodology to accurately perform the multitude analysis from an arbitrary multitude density and arbitrary perspectives in a separate video.
2. An evaluation of interventions and a comparison of these established methods specifically for activity with recent deep CNN networks. 3. A new dataset based on Hajj pilgrimage specifically for the crowds around the Kaaba area. Crowd datasets such as Shanghai Tech, UCSD, and UCF CC 50 are available for crowd analysis research, however our dataset contains large numbers of crowds.

Related works
Early works on the usage of detection methods in crowd counting are presented. 7-11 Typically, these approaches refer to an individual or head detector through a sliding picture window. Recently, many exceptional object detectors have been presented, including Region Based Convolutional Neural Networks (R-CNN), 12-14 YOLO 15 and SSD, 16 which can have a low precision of detection in scattered scenes. Some works such as Idrees et al. 17 and Chan et al. 18 implement regression-based approaches that learn directly from the crowd images in order to minimize these issues. They normally extract global 19 (texture, gradient, edge) or local characteristics 20 for the first step (SIFT, 21 LBP, 22 HOG, 23 and GLCM 21 ). Then several regression techniques such as linear regression 24 and Gaussian mixture regression 25 are employed to map the crowd counting function. These approaches manage the problems of occlusion and context disorder successfully, but spatial detail is still ignored. Thus, Lemptisky et al. 26 have developed a framework that focuses on density assessment, learning to linearly plot local features and charts. A non-linear mapping, random forest regression, which is achieved the same forest to train two separate forests, is proposed in order to reduce the challenge of studying linear mapping. 27 Previous heuristic models that traditionally used CNNs to estimate crowd density 28-31 have improved REVISED Amendments from Version 1 We are happy to submit a revised version of our work, titled "Crowd density estimation using deep learning for Hajj pilgrimage video analytics," which incorporates the reviewers' suggestions. The outline of the changes made from version 1 to version 2 and the reason for those changes are discussed below: In the Result Analysis section, we have added the Training, Testing and Validation in details based on the first reviewer's comment. We did mention about cross-fold validation and tune of hyperparameters.
In the Methods section we have added new images ( Figure 2)  Any further responses from the reviewers can be found at the end of the article significantly compared to conventional handcrafted methods. Considering the drawbacks of these conventional methods we have employed improved CNN.

Methods
We proposed a model that employs the state-of-the-art crowd counting algorithms used for the Hajj pilgrimage. The algorithms predicted specific regions on people's heads for Hajj crowd images. The head size for each individual is identified using multi-stage procedures. Figure 1 shows the suggested architecture of CNNs, which is made up of three key components.

Architecture of CNN layer
In addition to CNN detectors, all existing CNN-related detectors are built on a deep-backbone feature extractor network. Furthermore, it is possible that detection accuracy is linked to functionality consistency. CNN-enabled networks are often used in counting crowds, and give an approximate real time performance. 31 The first five CNN convolution blocks initialized using ImageNet training are the backbone network's starting point. 33 Typically, a CNN design consists of a single input layer, many convolutional and pooling layers, numerous fully connected layers, and a final output layer for automating the feature extraction process. As input, an RGB crowd image of 224 by 224 pixels is accepted, with data downsampling in each block for maximum pooling. Except for the last blocks, which are copied by the following blocks, every block on the network branches. A resolution of 0.5, 0.25, 0.125, and 0.166 is utilized to generate feature maps when using cloned blocks. Figure 2 shows the architecture of CNN layers in our experiment.

Classification of the box
Instead of making everything the same size, we used a per pixel categorization approach for scaling. The model classifies each head as part of or inside the context of one of the bounding boxes. Model scale branches generate map set D s n È É nB h ¼ 0, showing the confidence level for each pixel for classes of the box. The final requirement for training the model is to know the model's users' head sizes, which is not easily accessible and cannot be reliably inputted from typical crowd sourced databases. We created a method to help estimate head sizes in this research. We used the crowd dataset accessible point annotations to get the ground truth. People's heads are located at certain coordinates with these annotations. Note that only quadratic boxes are regarded as box-like. It is situated approximately in the center of the head, though it may vary drastically depending on the number of people. The same applies to scale, since it not only indicates the scale of each person in the crowd, but also shows scale in the form of annotation points. Assuming a homogeneous density of the crowd, the space between two nearby people may represent the size of the box, depending on the dimensions of the crowd. Know that only quadratic boxes are regarded as box-like. In simpler words, a given head size is equivalent to the length of the neighbor closest to it. It is right to use these boxes for crowds of medium to large sizes, but for those with sparse populations with far closest neighbors, these box dimensions may be wrong. However, on the whole, they are deemed experimentally effective, providing an accurate distribution of head sizes throughout a broad range of densities. However, on the whole, they are deemed experimentally effective, In choosing the Box U+03B2 (s)/b s for each scale, a popular approach is used. At the maximum resolution scale (s = ns U+2212 1), the initial box size (b = 1) is often set at one, which increases the ability to handle the extremely congested density. The standard size of increase values on different scales are the y = 4, 2, 1, 1 definition. Please note that at highlevel (0.5 and 0.25), in which coarse resolution is appropriate (as shown by Figure 1), boxes of better sizes include those of low resolution (0.16 and 0.25). 33

Count of heads
For testing the model in Figure 1, the predictive fusion procedure is utilized in place. The multi-resolution prediction is made across all branches of the picture pipeline. Using these prediction charts, we can anticipate that the locations of the boxes are linearly scaled from the resolution of the input. When the present NMS is in place, then it is used to prevent multi-threshold mixing.

Annotation technique
We used python 3.6.15 and opencv-python 3.4.11.43 as an annotation tool to easily annotate head positions in the crowds. The process involved two types of labelling: point and bounding box. During the annotation process, the head is freely zoomed in/out, split into a maximum of 3 Â 3 tiny patches, allowing annotators to mark a head in 5 sizes: 2x (x = 0,1,2,3,4) times the original image size. In this study, we developed a technique for estimating head sizes. To get the ground truth, we utilized available point annotations from the crowd dataset. With these annotations, the heads of individuals are positioned at certain locations. It is worth noting that only quadratic boxes are considered box-like. It is located about in the middle of the head, but this might vary significantly depending on the population. The same holds true for scale, which not only represents the size of each individual in the crowd but also displays scale in the form of annotation points.

Experimental design
Firstly, we gathered all images of size 1280 Â 720 pixels. Then we applied a profound learning method to improve the CNN and obtain the best outcomes. Training and analysis was done using the pytorch 1.

Experimental analysis
The HAJJ-crowd data collection consisted of three sections, the examination, validation and training. The count accuracy which is the Mean Absolute Error (MAE) and Mean Squared Error (MSE) should be measured in two measurements. The equations are shown below: In this scenario, N is assumed to be the test sample, y i is regarded as the count mark, whereas y 0 i is the approximation count sample. For each set of persons, the preceding group consists of (0), (0, 1000) (1000, 2000), (2000,3000). In accordance with the annotated number and quality of the image, each image is allocated an attributing label. In the test set, MAE and MSE are applied for the matching samples in a particular viewpoint for each class. For example, the luminescence attribute calculates average MSE and MAE figures based on two categories that demonstrate the counting models' sensitivity to luminescence variation. Figure 3(a) and Figure 3(c) indicate clearly that there is no significant change in the loss of pixels from zero to ten epochs, whereas there is a ten pixel loss from ten to 20 epochs. However, the pixel loss between 20 and 30 epochs keeps increasing, up to 40-52 epochs. At the end, the pixel loss is 15.0 at 52 epochs. We may get genuine training loss from this experiment. More than anything, the legitimate pixel loss in tests is 17 at 40 epochs and 14 at 52 epochs. At the same time, based on the preceding equation, we computed the MAE test. We have computed the valid MAE test loss and the valid MAE test that is shown in Figure 3(b) and Figure 3(d). For the MAE test, we found that the error is over 600 when the epoch is zero. We saw the error coming down to 200.0 after 52 epochs. In the Test MSE, we saw the error is over 425 if the epoch is zero. After that, we saw that the error came down to 240.0, after 52 epochs. Figure 3 shows the graphical representations of the results.

Proposed method comparison with state-of-the-art methods
The HAJJ-crowd dataset contains a large number of crowds as well as a density collection. It contains 1050 training images and 450 testing images with the same resolution of 1280 Â 720 pixels. For our Hajj-Crowd dataset, we have used 80% data for training and 20% data for testing and we could successfully validate 90% data. For our experiment, we have used three fold cross validation. The mainstream UCF CC 50 dataset are compared with the most advanced non-defined approaches 34-38 in terms of the MAE and MSE. Our method and dataset outperforms the state-of-the-art methods, and attains a remarkable MAE result of: 200.0 (Average of 82.0 points improvement) and MSE of 240.0 (Average of 135.54 points improvement). We established the range of feasible values for each hyperparameter, as well as a sampling technique, evaluation criteria, and a cross-validation procedure. MSE is calculated as follows, which makes mathematical operations easier than with a non-differentiable function such as MAE. Table 1 shows the comparison with state-of-theart methods.

Conclusions
This paper provides a new approach for crowd density estimation using a convolutional neural network. A multi-column structure of high-level feedback processing that addresses the problems in large crowds is the proposed model of the convolutional neural network. The proposed model can recognize moving crowds, which leads to improved performance. We found that crowd analysis prior to crowd counting has significantly boosted the efficiency of counting for extremely dense crowd scenarios. The proposed method outperforms the state-of-the-art method, with a Mean Absolute Error of 200 and a Mean Square Error of 240.

Data availability
Underlying data Due to the ethical and copyright limitations around social media data, the underlying data for this study cannot be disclosed. The original dataset contains a total of 1500 images, all of which were collected from the Mecca Hajj 2019. The dataset contains three classes of crowd density around tawaf area. The Methods section offers extensive information that will enable the research to be replicated. If you have any questions concerning the approach, please contact the corresponding author.

Open Peer Review I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.
The related work section should group the works technically and highlight the drawbacks.
The classification of heads from the images is not clearly defined mathematically; explain it.
The annotation process should be explained clearly: point and bounding box -define them.
The abstract should be rewritten, "This paper aims to propose an algorithm" -this statement is not correct.
The conclusion also does not emphasize the technical solution and the findings, rewrite accordingly.

If applicable, is the statistical analysis and its interpretation appropriate? Partly
Are all the source data underlying the results available to ensure full reproducibility? Partly Are the conclusions drawn adequately supported by the results? Partly handcrafted features in identifying crowd-specific characteristics. We propose a framework for crowd counting based on convolutional neural networks (CNNs) in this study.
2. The related work section should group the works technically and highlight the draw backs.
Ans: Considering the drawbacks of these conventional methods we have employed improved CNN.
3. The classification of heads from the images is not clearly defined mathematically; explain it.
Ans: It is situated approximately in the center of the head, though it may vary drastically depending on the number of people. The same applies to scale, since it not only indicates the scale of each person in the crowd, but also shows scale in the form of annotation points. Assuming a homogeneous density of the crowd, the space between two nearby people may represent the size of the box, depending on the dimensions of the crowd. Know that only quadratic boxes are regarded as box-like. In simpler words, a given head size is equivalent to the length of the neighbor closest to it. It is right to use these boxes for crowds of medium to large sizes, but for those with sparse populations with far closest neighbors, these box dimensions may be wrong. However, on the whole, they are deemed experimentally effective, providing an accurate distribution of head sizes throughout a broad range of densities.
4. The annotation process should be explained clearly: point and bounding box -define them.
Ans: In this study, we developed a technique for estimating head sizes. To get the ground truth, we utilized available point annotations from the crowd dataset. With these annotations, the heads of individuals are positioned at certain locations. It is worth noting that only quadratic boxes are considered box-like. It is located about in the middle of the head, but this might vary significantly depending on the population. The same holds true for scale, which not only represents the size of each individual in the crowd but also displays scale in the form of annotation points.
5. The abstract should be rewritten, "This paper aims to propose an algorithm" -this statement is not correct.
Ans: This research proposes an algorithm based on a Convolutional Neural Networks model specifically for Hajj applications.
6. The conclusion also does not emphasize the technical solution and the findings, rewrite accordingly.
Ans: The proposed method outperforms the state-of-the-art method, with a Mean Absolute Error of 200 and a Mean Square Error of 240.

If applicable, is the statistical analysis and its interpretation appropriate? Yes
Are all the source data underlying the results available to ensure full reproducibility? Partly Are the conclusions drawn adequately supported by the results? Yes 1. In the Methods section, the authors wrote the following statement: " Figure 1 shows the suggested architecture of CNNs, which is made up of three key components". But actually Figure 1 shows the proposed model. Also, it doesn't explain the architecture of CNN or DNN.
Ans: Typically, a CNN design consists of a single input layer, many convolutional and pooling layers, numerous fully connected layers, and a final output layer for automating the feature extraction process. As input, an RGB crowd image of 224 by 224 pixels is accepted, with data down sampling in each block for maximum pooling. Except for the last blocks, which are copied by the following blocks, every block on the network branches. A resolution of 0.5, 0.25, 0.125, and 0.166 is utilized to generate feature maps when using cloned blocks.
2. Due to ethical issues, the authors haven't supplied the exact dataset. Also, the authors described in Methods section that they have collected the data from YouTube videos. Probably, the authors could provide the video link in the reference, so that the real complexity in the video could be understandable.
3. Also author need to clarify the necessity of using MAE and MSE as their performance analysis.
Ans: MSE is calculated as follows, which makes mathematical operations easier than with a non-differentiable function such as MAE. That is why MAE and MSE are very important for the performance analysis in our experiment.