We present Adam, an algorithm designed for first-order gradient-based optimization of stochastic objective functions, 
leveraging adaptive estimates of lower-order moments. This approach is easy to implement, computationally efficient, 
and requires minimal memory. It is also invariant to diagonal rescaling of gradients and is particularly suitable for problems 
with large datasets and/or parameter spaces. Additionally, the method works well for non-stationary objectives and issues 
involving noisy and/or sparse gradients. The hyperparameters have intuitive meanings and generally require minimal tuning. 
We explore connections to related algorithms that inspired Adam. Theoretical analysis of the algorithm’s convergence 
properties is also provided, along with a regret bound for the convergence rate, which matches the best-known results 
within the online convex optimization framework. Empirical results show that Adam performs effectively in practice and 
outperforms other stochastic optimization methods. Lastly, we introduce AdaMax, a variant of Adam based on the infinity norm.