Tapered Off-Policy REINFORCE: Stable and Efficient Reinforcement Learning for LLMs