I was wondering why did you use BFGS optimization instead of inbuilt ADAM/Gradient descent optimization method in pytorch?