Large language models often know when they are being evaluated