Inspired by 2021 in titles by Jason Werner (who I found clicking the "next" link on the Indieweb ring as I enjoy doing), I decided to play at my own Wordcloud for 2021. Rather than just grabbing all the titles for the year and throwing it at a web site, I had to spice it up a bit.
Step 1: Dockerised https://github.com/amueller/word_cloud
The wordcloud tool Jason used is amueller/word_cloud - a pretty great wordclouder with templates and all sorts of power. I've been running almost every new tool in their own docker container these days to try and not pollute my own environment, so I threw together a quick container (again, inspired by the work on Dockerised Puppeteering.
FROM python:3 RUN mkdir /wordcloud WORKDIR /wordcloud RUN pip install wordcloud
Simple and effective - grab the latest Python 3 container, install wordcloud into the new /wordcloud directory. Job done.
Step 2: Harvest the content from all my posts for 2021
This took me a while. I thought
sed could pattern match my way through a
find from the subdirectory... but sed is line per line so that didn't work. I looked at other tools like csplit before googling my way to remembering "oh wait, this is what Grep does." Reading through the
grep --help revealed a lot, and this post on multiline grep and the -P flag gave me the last little bit I needed.
> grep -hPor --include=*.md "(?s)===(.*)" DIRECTORY/posts/*/2021 | pandoc -t plain > output/input.txt
-hHide the file name from the output
-Puse PECL regexp, so I can add the
(?s)flag to do multiline matching
-oOnly print the matching lines
-rto recurs through the directories.
--include=*.mdonly look at Markdown files in the directory
"(?s)===(.*)"My files use === as the Markdown preamble separator, and I'm sure not to use them anywhere else in the file so my parser doesn't break. So I can tell it to only bring back anything after a ===.
DIRECTORY/posts/*/2021start at the posts/*/2021 directory and bring back all the matching files
|Pass the output of that command, to the next command
pandocThis is the next command, my markdown converter
-t plainoutput plain text, to remove all the formatting and be as close to just my words as possible.
> output/input.txtsend the resulting text to the input.txt file
> docker run -i --rm -v THISDIR\output:/wordcloud/output -t wordcloud wordcloud_cli --text output/input.txt --imagefile output/mep.png
docker runLet's run a prebuilt image
iinteractive shell so we can stop when we want, if we want
--rmdelete the running image once it's finished running, so we don't leave stuff laying about
-v THISDIR\output:/wordcloud/outputMount the local output directory into the image so the running image can access the input file
-t wordcloudthis is the image we want to run, the one we built earlier
wordcloud_cli --text output/input.txt --imagefile output/mep.pngrun the wordcloud_cli command over the text we made earlier, giving us the image we want!
And it works:
And I can do it just for my titles, like Jason did
> grep -ihPor --include=*.md "Title:.*" DIRECTORY/posts/*/2021 | pandoc -t plain > output/title.txt > docker run -i --rm -v THISDIR\output:/wordcloud/output -t wordcloud wordcloud_cli --text output/title.txt --imagefile output/title.png
I like coding, comedy, games (computer and tabletop RP), and steampunk. From the clouds, it's very obvious I've been focusing on the coding. Not sure if that's because it's so much easier during the pandemic to fire up the code windows rather than trying to steampunk things. I think I need to make an effort this year to broaden my hobby time.
Yes I now recognise the irony, thank you.