ALL Metrics
-
Views
-
Downloads
Get PDF
Get XML
Cite
Export
Track
Research Article

Hybrid Quantum or Purely Classical? Assessing the Utility of Quantum Feature Embeddings

[version 1; peer review: 1 approved, 1 approved with reservations]
PUBLISHED 23 Aug 2024
Author details Author details
OPEN PEER REVIEW
REVIEWER STATUS

Abstract

Background

As graph datasets—including social networks, supply chains, and bioinformatics data—grow in size and complexity, researchers are driven to search for solutions enhancing model efficiency and speed. One avenue that may provide a solution is Quantum Graph Learning (QGL), a subfield of Quantum Machine Learning (QML) that applies machine learning inspired or powered by quantum computing to graph learning tasks.

Methods

We reevaluate Quantum Feature Embeddings (QFE), a QGL methodology published by Xu et al. earlier this year. QFE uses Variational Quantum Circuits to preprocess node features and then sends them to a classical Graph Neural Network (GNN), with the goal of increasing performance and/or decreasing total model size. Xu et al. evaluated this methodology by comparing its performance with the performance of variously-sized classical models on the benchmark datasets PROTEINS and ENZYMES, and they report success.

Our core methodology and learning task remain unchanged. However, we have made several changes to the experimental design that enhance the rigor of the study: 1) we include the testing of models with no embedder; 2) we conduct a thorough hyperparameter search using a state-of-the-art optimization algorithm; and 3) we conduct stratified five-fold cross-validation, which mitigates the bias produced by our small datasets and provides multiple test statistics from which we can calculate a confidence interval.

Results

We produce classical models that perform comparably to QFE and significantly outperform the small classical models used in Xu et al.’s comparison. Notably, many of our classical models achieve this using fewer parameters than the QFE models we trained. Xu et al. do not report their total model sizes.

Conclusion

Our study sheds doubt on the efficacy of QFE by demonstrating that small, well-tuned classical models can perform just as well as QFE, highlighting the importance of hyperparameter tuning and rigorous experimental design.

Keywords

Quantum Machine Learning, Quantum Graph Learning, Graph Neural Networks, Quantum Feature Embedding

1. Introduction

Over the last two decades, a new field at the intersection of quantum computing and machine learning has emerged: Quantum Machine Learning (QML). The methods in this field are best understood when broken down into three categories, as described by Houssein et al.1:

  • 1. Pure QML: algorithms that rely solely on quantum circuits, like the quantum version of the support vector machine,

  • 2. Quantum-inspired ML: classical machine learning that takes inspiration from the field of quantum computing,

  • 3. Quantum-Classical Hybrid ML: techniques that use both quantum computing and classical computing; this includes Variational Quantum Circuits (VQCs) — quantum circuits with classical parameters that can be updated during training using a classical optimizer.

Algorithms in each of these categories can be applied to a variety of tasks including classification, regression, and optimization.1

Within QML, an even newer subfield called Quantum Graph Learning (QGL) is starting to be explored. As described by Yu et al.,2 QGL has the potential to solve or mitigate several substantial problems in graph learning including the difficulty of storing and processing large graphs and the limitation of the distance across which inferences can be made. QGL should also be able to apply some of the native benefits of QML to graph learning, including a reduction in the number of required training parameters.

Yu et al.2 describe several QML techniques that can be applied to graph learning, including various types of quantum optimization, quantum random walks and quantum graph kernels (used to represent graphs in the vector space of quantum circuits), and VQCs.

1.1 Literature review

The Quantum Feature Embeddings (QFE) methodology proposed by Xu et al.3 is a special case of the VQC technique. It applies a VQC to the node features of the graph input to generate embeddings, which are then passed to a classical message-passing model (e.g. a Graph Convolutional Network4). This is depicted in Figure 1. (Note that an additional pooling layer has been included. Although this was not mentioned by Xu et al.,3 it is necessary for graph classification.)

463474e1-eeb5-49ad-af5b-7ed3df346118_figure1.gif

Figure 1. The QFE Methodology created by Xu et al.3 (with an inferred pooling layer).

Xu et al. argue that using a VQC to create embeddings provides several benefits. First, because of the unitary nature of the circuit, the norms of the embeddings are preserved. This preservation helps increase the stability of the model during training. The unitary nature of the VQC also implies that it acts as a bijective function, which means that distinct inputs will not be mapped to the same output. This implies that information will not be lost during the embedding process. Finally, the exponential nature of the circuit’s vector space allows the VQC to approximate complex functions with exponentially fewer trainable parameters compared to classical models.

To test QFE empirically, Xu et al. evaluated its performance on the classification problems defined by two benchmark datasets: PROTEINS5,6 and ENZYMES.5,7 That performance was then compared with the performance of two classical embedders using Multi-Layer Perceptrons (MLPs). One had D hidden nodes (to approximate the number of parameters in QFE), where D is the number of input features; the other had 2D hidden nodes (to approximate the expressive power of QFE’s vector space). Notably, Xu et al. did not compare QFE with a model with no embedding layer.

Their results, as depicted in Figure 2, led them to conclude that QFE provides an increased accuracy compared to classical models with similar numbers of parameters and can keep up with classical models that have exponentially more parameters.

463474e1-eeb5-49ad-af5b-7ed3df346118_figure2.gif

Figure 2. The results reported by Xu et al. as depicted in their article.3

Note: The State-Of-The-Art (SOTA) performance on PROTEINS is 85.7%,8 which is much higher than what QFE achieved. However, because QGL and QML in general are young fields, we do not necessarily expect QML models to achieve SOTA performance. Instead, we simply want to determine if the methodology has a positive impact and if it might be useful in the future.

1.2 Our contributions

We critically examine QFE using an improved experimental design with state-of-the-art hyperparameter tuning. This approach yields classical models that perform comparably to QFE and significantly outperform the small classical models used in Xu et al.’s comparison. In addition, we demonstrate that it is possible to obtain similar accuracies with classical models that have fewer trainable parameters.

While this paper may not explore new and exciting QML methodologies, it does provide a rigorous evaluation of an existing technique, which is scientifically valuable. Without negative and critical results like this one, it would be impossible to allocate precious research resources efficiently.

2. Methods

2.1 The embedder

The structure of the individual models we tested remains largely unchanged from the methodology created by Xu et al. During inference, node features are taken from the input graph (or batch of graphs) and passed to the embedder component (if it exists). There are five options for the embedder component:

  • 1. No embedder component,

  • 2. MLP-D, a 2-layer perceptron with D input nodes, D hidden nodes and D output nodes,

  • 3. MLP-2D, a 2-layer perceptron with D input nodes, 2D hidden nodes, and D output nodes,

  • 4. QFE-exp, which measures the expectation value of each wire with respect to the Pauli Z gate, and

  • 5. QFE-probs, which measures the probability of each possible output (measured in the computational basis).

where D is the number of features each node has. Two versions of the QFE embedder are included because Xu et al. did not specify their measurement method.

The QFE embedder, a VQC, starts with an angle embedding component using Ry gates (defined below). This component is used to convert the classical node features into a state the quantum computer can process.

Ry(θ)=(cos(θ/2)sin(θ/2)sin(θ/2)cos(θ/2))

In the entangling component, parameterized Rx gates and controlled not gates (defined below) are combined to create entangling layers which allow the circuit to represent complex functions. The parameters (i.e. the many instances of θ) in these layers are used to update the circuit based on loss calculated between the circuit’s output and the expected output.

Rx(θ)=(cos(θ/2)isin(θ/2)isin(θ/2)cos(θ/2))
CNOT=(1000010000010010)

The number of layers used in the entangling component is a hyperparameter of our model. As in Xu et al.’s article,3 we finish with a measuring component that measures all of the wires in the quantum circuit.

2.2 The message-passing model

Following that, the embeddings and the edge connections from the input graph are given to the message-passing model. This model can use any type of message-passing layer (including any one of the 66 convolutional layers provided by the PyTorch Geometric python package9). We chose to test the three layer types chosen by Xu et al.3:

  • 1. Graph Convolutional Network (GCN),4

  • 2. GraphConv,10

  • 3. Graph Attention Transformer (GAT).11

The number of hidden channels and the number of layers used by the message-passing model are hyperparameters. The number of output channels is equal to the number of classes in the classification problem.

2.3 Pooling

Next, a pooling layer is used to aggregate the data spread across the nodes of the graph being processed. We tested the three simple options below.

  • 1. Mean Pooling

  • 2. Max Pooling

  • 3. Sum Pooling

Finally, a log Softmax function is applied.

2.4 Training

During initial training runs, we tried both cross-entropy loss, which was used by Xu et al.,3 and Negative Log Likelihood (NLL) loss, which was used in PyTorch Geometric9 examples and in HGP-SL,8 the classical model which achieved the SOTA accuracy on the PROTEINS dataset. Both losses produced similar results; we chose to use NLL loss for its popularity.

To optimize the parameters of the entire model (including the QFE embedder, when applicable), we used Adam.12

To mitigate overfitting, we implemented early stopping with a patience of 30 epochs and a maximum training duration of 200 epochs. The metric used for early stopping is the loss taken from the validation dataset. We also employed dropout, the rate of which is a hyperparameter of the model.

2.5 Hyperparameter tuning

For each embedder option, we optimized our hyperparameters by running a multi-objective Optuna13 study that used the Non-dominated Sorting Genetic Algorithm II (NSGA-II) algorithm.14 Each study included approximately 200 experiments. Their objectives were to …

  • maximize average validation accuracy,

  • minimize the variability of the validation accuracy (based on a 95% confidence interval),

  • minimize the number of trainable parameters.

The hyperparameters we optimized are listed below.

  • The number of QFE layers (if applicable)

  • The message passing layer type

  • The number of layers and hidden channels in the message-passing model

  • The dropout rate

  • The Adam optimizer’s learning rate and weight decay

We also optimized the batch size during our initial experiments. However, since this hyperparameter was shown to have low importance (using the fANOVA evaluation method15) and since the varying memory usage it caused made it difficult to run experiments in parallel, we ultimately chose to hold it constant at 1024.

To increase the stability of each experiment within the Optuna13 study and obtain an estimate of variability, we performed stratifiedfive-fold cross-validation on the provided training dataset and then bootstrapped the accuracies to obtain a 95% confidence interval.

2.6 Cross-validation and testing

Stratified five-fold cross-validation was conducted on the entire dataset to control for the bias caused by our train/test data split and to provide multiple test accuracies which could be used to estimate the variability of the test statistic.

For each of the train/test data splits provided by cross-validation and for each embedder option, we ran hyperparameter tuning and then selected three of the best models produced, each maximizing one of the utility functions below.

UBestAll(model)=zaze+zp
UBest Accuracy(model)=za
ULowParameters(model)=za+3zp
where za is the z-score of the model’s validation accuracy, ze is the z-score of the model’s accuracy variability, and zp is the z-score of the model’s trainable parameter count. All z-scores [1] are taken with respect to the distribution of experiments at the Pareto front of the corresponding Optuna13 study.

To decrease the storage and maintenance burden during hyperparameter tuning and because each model only took a few minutes to train [2], we chose not to save model weights. Instead, we retrained each of the selected models after tuning, this time using a stratified shuffle split to divide the training dataset into a training dataset and a validation dataset (for use during early stopping). To decrease the non-deterministic effects of our training script on the results, we also implemented a minimum training duration of 60 epochs. If early stopping was triggered before then (which likely indicates that training had failed to produce a well-performing model), then the model would be retrained. This could happen up to five times before the training script progresses to the next model.

Finally, we evaluated the selected models from each fold on their corresponding test dataset and then used bootstrapping to calculate a 95% confidence interval for the mean test accuracy.

3. Experiments

To evaluate our models, we used the same two protein-related graph datasets used by Xu et al.3: PROTEINS5,6 and ENZYMES.5,7 Both of these datasets are a part of the TUDataset collection16 and are publicly available on http://www.graphlearning.io.

The graphs contained in these datasets represent the structures of proteins, and the nodes within these graphs represent secondary protein structures — helixes, sheets, and turns. Two nodes are connected “if they are neighbors along the amino acid sequence or one of three nearest neighbors in space”.16 In addition to the secondary structure type, which is represented in the dataset using one-hot encoding, nodes have feature(s) representing some of the structure’s physical and/or chemical properties. The PROTEINS dataset has one such feature; the ENZYMES dataset has 18. Because our computer could not simulate the 21 qubits (three one-hot encoded categories + 18 additional features) that would be required for the ENZYMES dataset, we used Principal Component Analysis (PCA) to reduce the node features back down to four dimensions. Despite this large reduction, the resulting features explain 99.3% of the variance in the original features.

In PROTEINS, proteins are classified as enzymatic or non-enzymatic. In ENZYMES, proteins (which are all enzymatic) are classified according to the class of the reaction they catalyze; there are six such classes. For both of these datasets, our goal is to predict the graph-level classifications based on the nodes’ features and graph’s structure as a whole.

Both datasets are quite small; PROTEINS contains 1113 graphs and ENZYMES contains 600. And the graphs in both datasets are mid-sized; in PROTEINS, they have an average of 39.06 nodes and 72.82 edges, and in ENZYMES, the have an average of 32.63 nodes and 62.14 edges. A few examples from PROTEINS and ENZYMES are depicted in Figures 3 and 4 respectively.

463474e1-eeb5-49ad-af5b-7ed3df346118_figure3.gif

Figure 3. A few graphs from the PROTEINS dataset.

463474e1-eeb5-49ad-af5b-7ed3df346118_figure4.gif

Figure 4. A few graphs from the ENZYMES dataset.

4. Results and Discussion

This section focuses on notable examples from and visual representations of our results. Our complete results data, optimized hyperparameters, model weights, and code have been published on Zenodo17 under the Creative Commons Attribution 4.0 International Public License (CC BY 4.0). Our code is also available on GitHub at https://github.com/jsimonrichard/QFE-Experiments under the MIT License.

4.1 PROTEINS

Using UBest Accuracy, we found models with average test accuracies on the PROTEINS dataset ranging from 0.68 to 0.72. In particular, we achieved an average accuracy of 0.68 using the QFE-probs embedder and an average accuracy of 0.71 using the QFE-exp embedder, both of which are comparable to the results achieved by Xu et al.3 However, we also achieved an average accuracy of 0.71 using the MLP-D embedder and an average accuracy of 0.72 using no embedder, which are comparable to the QFE accuracies and significantly higher than the MLP-D accuracies produced by Xu et al.3

The average sizes of our models range from 111K to 320K trainable parameters. Notably, the average number of trainable parameters used by our MLP-D-based models is lower than the average for our QFE-based-models (but not significantly lower) even though it exceeds their average accuracies. The results for all five embedder types are depicted with error bars (α=0.05) in Figure 5.

463474e1-eeb5-49ad-af5b-7ed3df346118_figure5.gif

Figure 5. Average accuracies and sizes of the models found with UBest Accuracy on PROTEINS, plotted together with error bars (α=0.05).

Unfortunately, Xu et al. did not report their models’ total sizes, so we cannot make direct size comparisons between our models and theirs. They did report the number of floating-point operations (FLOPS) required for MLP-2D embedder to process a graph with 40 nodes and the quantum gate operations required for their QFE embedder to process a graph of the same size.3,18 However, these metrics are not relevant to our analysis since the sizes of the embedders are overshadowed by the variance in the sizes of the message-passing layers.

The models found using UBestAll performed almost as well as those found with UBest Accuracy, but with an order of magnitude fewer parameters. They achieve accuracies ranging from 0.68 to 0.71 with average parameter counts ranging from 13K to 43K. Their results are depicted in Figure 6.

463474e1-eeb5-49ad-af5b-7ed3df346118_figure6.gif

Figure 6. Average accuracies and sizes of the models found with UBestAll on PROTEINS, plotted together with error bars (α=0.05).

Finally, we found models using ULowParameters that maintain that performance with even fewer parameters. Their average accuracies range from 0.69 to 0.72 and their average sizes range from 3K to 11K trainable parameters. Notably, this includes the models using MLP-2D, which maintain an average accuracy of 0.69 using an average of 3K trainable parameters. The results for all five embedder types are shown in Figure 7.

463474e1-eeb5-49ad-af5b-7ed3df346118_figure7.gif

Figure 7. Average accuracies and sizes of the models found with ULowParameters on PROTEINS, plotted together with error bars (α=0.05).

4.2 ENZYMES

Our results from ENZYMES represent an even more drastic departure from the results reported by Xu et al. In fact, every single average taken from the models using MLP-D or no embedder is significantly greater than the highest MLP-D accuracy that Xu et al. achieved [3].

The models found using UBest Accuracy achieve average accuracies ranging from 0.32 to 0.46 with average parameter counts ranging from 124K to 299K. Notably, the models with no embedder achieved the highest average accuracy (0.46) with an average of 269K trainable parameters. This average is significantly greater than all of the accuracies reported by Xu et al. The results for all five embedders are depicted in Figure 8.

463474e1-eeb5-49ad-af5b-7ed3df346118_figure8.gif

Figure 8. Average accuracies and sizes of the models found with UBest Accuracy on ENZYMES, plotted together with error bars (α=0.05).

Again, most of the models found using UBestAll achieve similar average accuracies (although the highest average is no longer 0.46) using an order of magnitude fewer parameters. Their average accuracies range from 0.26 to 0.35 and their average parameter counts range from 18K to 46K. These results are shown in Figure 9.

463474e1-eeb5-49ad-af5b-7ed3df346118_figure9.gif

Figure 9. Average accuracies and sizes of the models found with UBestAll on ENZYMES, plotted together with error bars (α=0.05).

Finally, the models found by ULowParameters achieve similar results with marginally fewer parameters for some and essentially the same number of parameters for others. They achieve accuracies ranging from 0.27 to 0.40 with sizes ranging from 3K to 48K trainable parameters. These results are shown in Figure 10.

463474e1-eeb5-49ad-af5b-7ed3df346118_figure10.gif

Figure 10. Average accuracies and sizes of the models found with ULowParameters on ENZYMES, plotted together with error bars (α=0.05).

5. Conclusion

Our analysis sheds doubt on QFE’s efficacy in the real world despite its interesting mathematical foundations. Using the same datasets as Xu et al.,3 we achieved similar and, in some cases, significantly better performance compared to the accuracies reported by Xu et al.3 using models with QFE, models with the MLP-D embedder, and models with no embedder at all. In addition, we showed that it is possible to find classical models with comparable performance that have fewer trainable parameters than their quantum-classical hybrid counterparts. Why did Xu et al. fail to reach these conclusions? There may have been a few contributing factors.

  • They overlooked classical models with no external embedder.

  • They probably did not use the same hyperparameter tuning method we did (they make no mention of their tuning methods in3).

In addition, they did not provide error bars, so it is hard to know whether their results are statistically significant.

Our study highlights the critical importance of careful experimental design, statistical testing, and thorough hyperparameter tuning. Without these things, it is difficult if not impossible to draw sound conclusions in a rapidly evolving field like QGL.

5.1 Limitations and future research

There are several limitations (and corresponding future research directions) to this study.

  • Because of the small sizes of both the PROTEINS and ENZYMES datasets, it is difficult to show statistically significant differences. We were able to do so in a few cases, but most comparisons that could be made are insignificant. Future research may include evaluating QFE on larger datasets.

  • It is still unclear why QFE is failing to provide an advantage in either accuracy or model size. One possibility is that the QFE embedder does not have much information to work with because it only operates on individual node features, which are fairly one-dimensional (in both PROTEINS and ENZYMES, the first component generated by PCA explains over 95% of the variance). QFE may still be useful in other contexts.

Implementation details

Our code can be found on GitHub: https://github.com/jsimonrichard/QFE-Experiments. The main libraries we used are listed below; thanks go out to all of their authors and maintainers.

  • Classical NN Framework: PyTorch19

  • Quantum Framework: Pennylane20

  • Dataset Management and Prebuilt Classical Models: PyTorch Geometric9

  • Cross-Validation: scikit-learn21

  • Stats Calculations: SciPy22

  • Hyperparameter Tuning: Optuna13

In addition, our training script was largely inspired by the training script written for HGP-SL16 (located at https://github.com/cszhangzhen/HGP-SL).

Comments on this article Comments (0)

Version 1
VERSION 1 PUBLISHED 23 Aug 2024
Comment
Author details Author details
Competing interests
Grant information
Copyright
Download
 
Export To
metrics
Views Downloads
F1000Research - -
PubMed Central
Data from PMC are received and updated monthly.
- -
Citations
CITE
how to cite this article
Richard JS. Hybrid Quantum or Purely Classical? Assessing the Utility of Quantum Feature Embeddings [version 1; peer review: 1 approved, 1 approved with reservations]. F1000Research 2024, 13:961 (https://doi.org/10.12688/f1000research.154428.1)
NOTE: If applicable, it is important to ensure the information in square brackets after the title is included in all citations of this article.
track
receive updates on this article
Track an article to receive email alerts on any updates to this article.

Open Peer Review

Current Reviewer Status: ?
Key to Reviewer Statuses VIEW
ApprovedThe paper is scientifically sound in its current form and only minor, if any, improvements are suggested
Approved with reservations A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.
Not approvedFundamental flaws in the paper seriously undermine the findings and conclusions
Version 1
VERSION 1
PUBLISHED 23 Aug 2024
Views
0
Cite
Reviewer Report 11 Oct 2024
Jun Qi, Fudan University, Shanghai, China;  Georgia Institute of Technology, Atlanta, Georgia, USA 
Approved
VIEWS 0
In this work, the author proposes a quantum feature embedding method combined with a graph neural network for real-world applications on Protein and Enzyme datasets. The theory and experiments are well-written and suitable for indexing. 

There are ... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Qi J. Reviewer Report For: Hybrid Quantum or Purely Classical? Assessing the Utility of Quantum Feature Embeddings [version 1; peer review: 1 approved, 1 approved with reservations]. F1000Research 2024, 13:961 (https://doi.org/10.5256/f1000research.169456.r327443)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.
Views
1
Cite
Reviewer Report 18 Sep 2024
Yunpu Ma, Ludwig Maximilian University of Munich, Munich, Germany 
Approved with Reservations
VIEWS 1
The paper proposes a method to evaluate the efficacy of an existing algorithm,
quantum feature embedding (QFE).

Strengths:
1. The experimental methodology is described with clarity and is easy to follow.
2. The comparative ... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Ma Y. Reviewer Report For: Hybrid Quantum or Purely Classical? Assessing the Utility of Quantum Feature Embeddings [version 1; peer review: 1 approved, 1 approved with reservations]. F1000Research 2024, 13:961 (https://doi.org/10.5256/f1000research.169456.r317287)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.

Comments on this article Comments (0)

Version 1
VERSION 1 PUBLISHED 23 Aug 2024
Comment
Alongside their report, reviewers assign a status to the article:
Approved - the paper is scientifically sound in its current form and only minor, if any, improvements are suggested
Approved with reservations - A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.
Not approved - fundamental flaws in the paper seriously undermine the findings and conclusions
Sign In
If you've forgotten your password, please enter your email address below and we'll send you instructions on how to reset your password.

The email address should be the one you originally registered with F1000.

Email address not valid, please try again

You registered with F1000 via Google, so we cannot reset your password.

To sign in, please click here.

If you still need help with your Google account password, please click here.

You registered with F1000 via Facebook, so we cannot reset your password.

To sign in, please click here.

If you still need help with your Facebook account password, please click here.

Code not correct, please try again
Email us for further assistance.
Server error, please try again.