My favorite function of all time is varsoc in Stata. That’s saying a lot because I have been working with computers for decades and have written software in several languages, used many different types of administrative software tool sets, and owned a lot of books with code in them. Varsoc regresses one variable, y, upon another variable, x, and then regresses each lag of y on x to produce output that allows one to know the best fit lag for a regression model. It allows someone analyzing time series data to immediately know that data from the several prior is a better predictor of today’s reality than more recent data. I adore Stata for scientific analysis. In order to use this for my big data project, I needed to automate it, and so I wrote an R vignette that would analyze 45 lags and produce the relevant test statistics. My vignette produces r^{2} values^{1}, parameter estimates, and f-statistics for 45 lags of y regressed on x. The p-values are then written to a CSV file. The decision rule for a p-value is that we reject the null hypothesis if the p-value is less than or equal to α/2.^{2} The data comes from 5GB of CSV files that were created via Python.

Running the lags shows us the relationships between the historical prices of two securities. When we regress y on x in this case, we are regressing the price of security 2 on security 1. We then do this on a lag. The L1 of security 2 regressed on security 1’s L0. Then we regress L2 of security 2 on security 1’s L0. This occurs for 45 iterations. For example, we might find that the price of a gold ETF 44 days ago has the best relationship with the price of Apple stock today as compared to the price of that same gold ETF 12 days ago and even today. That’s an example only and not anything substantiated in the data. There will certainly be some spurious relationships. An ETF buying shares of Apple and then the same ETF’s fee going up the next month, for example. To mitigate this, the vignette uses the first difference of the logarithm so that the data is stationary. The CSVs are already produced so that unit roots are accounted for. This is a research project to identify what actually bodes well in other sectors. It runs on every listed security on the American exchanges. Every symbol is regressed on Apple. Every symbol is regressed on Microsoft, and so on. The data is stationary and unit roots are eliminated.

I initially began this project some time ago and at that time I stopped because it was going to take a solid month of continuous 12-core processing to accomplish the entire series. In retrospect, I should have let that proceed but there would have been a great tradeoff in that I couldn’t have played Roblox, The Isle, and Ark Survival Evolved with my daughter. Finally, I’ve got the research running on a new machine dedicated to that purpose. That machine uses an AMD Ryzen 5 3500 and NVMe SSD. The program is running on 6 cores in parallel. Previously, with the one month estimate, it was running concurrently on 12-cores of Westmere Xeon CPUs and storing the output in RAM instead of on an SSD. This will serve as an interesting test for the Ryzen since all six cores will be running at 100% for months on end. The operating system is OpenSuse Leap 15.2, the R version is 4.05, and the Python version is 2.7.18.

One of the reasons to write these articles is for my own memory. It gets older to remember as one gets older. These blog posts are essentially a public notebook to aid myself and others.

^{1} R^{2} is the coefficient of determination, which is the square of the Pearson correlation coefficient, r, the formula for which is ρ=β_{1}(σx/σy), where β_{1} is the parameter estimate. ASCI and Unicode text does not have a circumflex, ^, on top of the β. For this documentation the objective is multiplatform long-term readability so an equation editor with specialized support for circumflexes is out of the question.

^{2} There is also the existence of the rejection region method. We reject the null hypothesis if the test statistic’s absolute value is greater than the critical value, which we can express with the formula *Reject if *|t| > t_{α/2,n-1}