Typically these types of single word embedding visualizations work much better with non contextualized models such as the more traditional gensim or w2v approaches, as contextual encoder-based embedding models like BERT don't 'bake in' as much to the token (word) itself, and rather rely on its context to define it.
Also, often PCA for contextual models like BERT end up with $PC_0$ aligned with the length of the document.