OpenAI? How about HuggingFace.co?

Wakeem's World
6 min readFeb 18, 2023
HuggingFace giving Open AI a piggy back rid πŸ‘¨β€πŸ‘¦ (mostly generated by DALL-E-2)

AI appears to be reaching peak hype much like Crypto/NFTs were a year ago I thought I would share a really great resource I have been using for learning/building/training AI models.

This article builds the AI Writer medium account and also posts the same articles to its own website if you wanna check out the end result.

huggingface.co

Huggingface.co is a site dedicated to β€œdemocratize good machine learning, one commit at a time.”

democratize good machine learning, one commit at a time.

Hugging face is taking a slightly different approach to publishing AI models then OpenAI where you can start with a really bare bones model, fine tune, then share.

I think tools like hugging face will be much more influential in progressing the AI revolution because they appear to be more focused on sharing models so other people can build other models off of a model you had started to train.

Free use endpoints

In my opinion one of the coolest features they have are the free test endpoints for some pre-trained models. This allows you to try out a model before training to see what it does. This is really awesome for models like the bart-large-cnn model that can produce decent summaries out of the box. It is also nice for people, like myself, who are new to AI tools and don’t really know where to start. πŸ˜…

You can also easily train your own models using custom data via their auto-train site. It is still in beta but basically it shows you what kind of data you might need to train a specific model and you can supply your own data to fine tune an existing model. Neat right?!

Paid for endpoints

Inference Endpoint Deployment Page

If you need a more robust endpoint for a specific model they also give you one click deployment options so you can deploy a model to a dedicated endpoint that you can configure with your own authentication or make completely public.

They also have tight integration with AWS SageMaker so you can run your training or deployment in your own AWS account if you don’t want someone else to manage your endpoints.

Lets build something

I am talking up hugging face quite a bit so to prove how cool this site is I am going to quickly create a web app that will create a daily summary of trending ABC News articles.

Project description

I think we can break this project down into a simple step function state machine that looks something like this πŸ‘‡

Article Summarizing State Machine
  1. Fetch articles from ABC News Feed (Skipping to the end if we have already generated articles for that day)
  2. Extract Article text
  3. Map over each article text and submit the article to a hugging face pre-trained model
  4. Finally, Publish articles to medium and maybe a custom site deployed to vercel depending on how adventurous I am feeling πŸ˜…

Fetch articles

This should be simple enough since ABC has an RSS feed that gives us links to stories in this feed. This should be as simple as making an http request and parsing the XML to find the link πŸ‘‡

const getArticleUrls = async () => {
const xml = await makeHttpsRequest(rssFeed);
const parser = new XMLParser();
const json = parser.parse(xml) as XMLRoot;

return json.rss.channel.item
.filter((item) => item.category !== 'Live')
.map((item) => ({
link: item.link,
tag: item.category,
}));
};

Extract Article text

Examining the ABC News Story page html structure reveals they are including the data-testid attribute on their article text and then the text is contained in all of the subsequent paragraph tags.

I think the data-testid attribute will be reliable because this is a very common attribute to use when testing react. So I am guessing these will not change often if they have tests relying on this attribute. πŸ˜ƒ

I can also use cheerio to parse and select elements on the page from my node backend so our code should look something like this πŸ‘‡

const abcNewsArticleScrape = async (link) => {
const html = await getDataFromUrl(link);
const $ = load(html);

const getAllTextFromChildParagraphTags = (parentElement) => {
const childParagraphTags = parentElement.find('p');
const textFromChildParagraphTags: string[] = [];
for (let i = 0; i < childParagraphTags.length; i++) {
if (childParagraphTags?.[i]?.children?.[0]?.data) {
textFromChildParagraphTags.push(childParagraphTags?.[i]?.children?.[0]?.data);
}
}
return textFromChildParagraphTags;
};
const parentElement = $('article[data-testid="prism-article-body"]');
const text = getAllTextFromChildParagraphTags(parentElement).join('\n');
return text;
}

Summarize Article

I am going to leverage the step function map step to iterate over all of the article text that I fetched in the previous step. So this state will be looking at a single block of article text.

I also want to just use the free huggingface.co endpoints to generate my summary/headline since this is just a dumb proof of concept project.

So I think I will want to make a POST request with the article text to a the bert-large-cnn model for the full summary and the t5-one-line-summary community model to generate my headlines.

Since I am using the free endpoints I also want to be sure to use expoential backoff when making the requests as it can sometimes take a while for these endpoints to boot up. Also I want to be a good citizen and not spam their free API endpoints. 😜

So code for this is also very simple πŸ‘‡

const summarize = async (text: string): Promise<{
summary: string;
headline: string;
}> => {
const headline = await backOff<string>(
() => getHeadline(text),
{ maxDelay: 60000 },
);
const summary = await backOff<string>(
() => getSummary(text),
{ maxDelay: 60000 },
);

return {
headline,
summary,
};
}

Publishing the articles

I don’t really want to do much work to design/build a website so I am just going to publish these daily summaries to medium and to a prebuilt nextjs blogging template deployed to vercel.

Medium
Publishing to medium is simple. It appears they have a very straightforward API that allows you to publish articles using markdown.

So I just need to make the following http POST request after packaging my summaries as a single markdown string πŸ‘‡

const publishArticles = async (articles: SummarizedArticle[]) => {
const title = markdown.getTitle();
const content = markdown.convertToMarkdown(articles);
publishArticleToMediumAPI(title, content);
}

Vercel Blog
This one is a bit more complicated because the template I am using is generating the site from static .mdx files.

To integrate with these .mdx files I am going to generate the .mdx file and save them to S3 during the publishArticles step.

Next I will set up a scheduled github action to pull down my s3 files and commit these .mdx files to the repo.

name: Frontend Update
on:
schedule:
# Run at 7:37 PM CST every day
- cron: '37 13 * * *'
workflow_dispatch:

jobs:
build:
runs-on: ubuntu-latest
environment: production
permissions:
contents: write
steps:
...
- name: Fetch Latest Articles
run: npm run downloadArticles
- name: Commit Latest Articles
uses: stefanzweifel/git-auto-commit-action@v4
with:
commit_message: "GithubAction: Updated Latest Articles"
branch: main
file_pattern: 'frontend/data/blog/*.mdx *.mdx'
commit_author: Author <actions@github.com>
skip_dirty_check: false
skip_fetch: false
skip_checkout: false

Finally, I will set up my frontend app against vercel so the commit will then trigger vercel to rebuild/deploy the site. πŸ€–

Final custom site preview with AI articles πŸ€–

Code

As per usual all of the code is available in this github repo if you would like to deep dive into this project πŸ‘‡

Don’t forget to follow AI Writer to view this projects daily news summaries and visit the custom blog site here. 😎

I really want to learn more about training and deploying my own models to hugging face so maybe my next article I will create my own custom model and try it out in spaces or something. πŸ˜›

Let me know what you think about this project and if you have any ideas on things I should try out with huggingface.co. πŸ€–

--

--

Wakeem's World

Attempting to weekly blog about whatever I am thinking about. Skateboarding, tech, film, life, etc.