The research on Query Performance Prediction (QPP) focuses on estimating the effectiveness of retrieval results in the absence of human relevance judgments. Accurately estimating the result of a search performed in response to a query has been extensively studied over the past two decades. With the rising popularity of virtual assistants along with evolving research on complex informa- tion needs, the need for reliable QPP methods as well as the number of potential applications significantly increases. In this work, we focus on improving the evaluation framework of QPP. As we see the existing evaluation as a considerable limitation in the improvement of QPP methods, a reliable and improved evaluation framework would constitute a stepping-stone for a breakthrough in QPP. The existing evaluation framework in QPP mainly relies on the measurement of the correlation coefficient between the per-query prediction scores and the actual per-query system effectiveness measure, usually Average Precision (AP). The QPP method that achieves higher correlation is considered to be superior. However, Hauff et al. demonstrate that higher correlation does not vouch for more accurate prediction. The authors additionally advocate the usage of Fisher’s 𝑧 transformation and Confidence Intervals (CIs) to determine statistically significant differences between multiple correlation coefficients. Furthermore, the existing evaluation methodology is true only per a specific combination of a corpus, retrieval method, and set of queries; and does not necessarily hold if any of these is changed. That is, the existing evaluation is not agnostic to the different components, thus any conclusions about the relative prediction quality of the QPP methods should be taken with a grain of salt.