Custom Image Dataset

Background:

When building a model, data collection is the most essential part of the machine learning lifecycle. All around the internet, there are so many datasets one can use to build and train a model with their class names and labels. What happens when one wants to create a custom dataset with custom classes? This question is posed by many beginners including me. This is where data scraping comes in.

What is data scraping? According to wikipedia, data scraping is a technique in which a computer program extracts data from human-readable output coming from another program. Data scraping comes in form of web scraping which its data can come in form of text, images, video or audio. There are so many frameworks out in the web that helps with data scraping such as Selenium, Beautiful soup and many more.

Focus: Image scraping

Like I mentioned earlier, there are so many web scraping tools such as Selenium, Beautiful soup and even some custom web scraping codes that its mode of operation is clone from its repository, run setup. py and voila, you can start scraping images from the web. But what happens when all these techniques does not work like in my case?

My search led me to a beautiful article that pointed to Deliton Junior , who developed an image dataset tool, IDT. This IDT came in handy for me in scraping and creating my custom classes/labels for a personal project that I am currently working on.

According to Deliton, Image Dataset Tool (idt) is a cli tool designed to make the otherwise repetitive and slow task of creating image datasets into a fast and intuitive process. The repository of this tool can be found here .

Pros:

A. It is easy to use.

B. It is fast in scraping and downloading image data.

C. After scraping, it can help you split your dataset using the percentage you choose.

Cons:

A. You will have to manually filter your dataset as some keywords chosen may be linked to other images that are not part of the images you want which is time consuming.

B. When you use the second option of scraping through binge, the terminal freezes even though the scraping is 100% complete.

C. It is not optimal for large data scraping.

D. It creates imbalance in the dataset scrapped.

If you are new to machine learning and what to work on your skill in computer vision, custom class dataset and a small amount of data, I will suggest you use the IDT.

In the case of data imbalance, this can be solved using varieties of techniques such as SMOTE, changing algorithms etc. I found an article about solving this problem by Tara Boyle and can be found here .

Conclusion:

This method is solely my opinion and what worked for me, other methods might work for other people. My reason for sharing this is to help beginners who may be having difficulties using other web scraping tools. You can connect with me on LinkedIn as I'm open to suggestions and recommendations. Thank you for reading.

Credit:

Deliton

Tara Boyle: Dealing with imbalanced data