剽竊專案 :: 無人島 — 與我一起腐爛的地方

前言

IG 上有很多把文字放在圖片裡的帳號，通常是些小故事，或是小文章之類的，我要把它全部載下來，圖轉文字，然後放進 gpt 之類的東西，另外再產生新文章

爬蟲

蠻簡單的，我用的是 instagrapi

import instagrapi
import pickle

client = instagrapi.Client()

user = client.user_info_by_username("XXXXXX")
# 印出來看會有 user id

user_media_data = client.user_medias(user_id)

pickle.dump(user_media_data, open("savefile.pickle", "wb")) # 用 pickle 存起來，怕說 load 太久斷線又要重 load

下載 image

import requests


def download_image(url, path):
    response = requests.get(url) # Send a GET request and retrieve the response

    if response.status_code == 200: # Check if the response status code is 200 (indicating success)  
        with open(path, 'wb') as f: # Open the image file in binary mode  
            f.write(response.content) # Write the response content to the image file  
                
    else: # If the response status code is not 200, print the error message  
        print("Failed to download the image", url)

圖轉文

這個就比較麻煩，問 gpt 是說用 oct

原文

您所描述的過程稱為 Optical Character Recognition (OCR)，也就是光學字符識別。Python 中有一個常用的OCR工具叫做 pytesseract，它是 Tesseract-OCR 的 Python 接口。以下是一個使用 pytesseract 讀取圖片中的文字並將其轉換為文本的基本 Python 程式：

首先，您需要安裝所需的 Python 套件和 Tesseract OCR 引擎。使用以下命令安裝 pytesseract:

pip install pytesseract

如果你在使用 Linux 或 Mac OS，你可以用下列命令安裝 Tesseract:

Linux:

sudo apt-get install tesseract-ocr

Mac:

brew install tesseract

如果你在使用 Windows，你需要訪問以下網址進行手動下載和安裝：https://github.com/UB-Mannheim/tesseract/wiki

以下是使用 pytesseract 讀取圖片中文字的 Python 程式：

import pytesseract
from PIL import Image

# 如果你在Windows上，你需要設定tesseract路徑到你的系統變數或者下面這樣指定路徑
# pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract.exe'

def ocr_core(filename):
    """
    此函數將圖片中的文字轉換為文字
    :param filename: 圖片檔案路徑
    :return: 圖片中識別的文字
    """
    text = pytesseract.image_to_string(Image.open(filename))
    return text

print(ocr_core('path_to_your_image'))

請將 'path_to_your_image' 替換為您的圖片檔案路徑。這個程式會打印出圖片中識別出的文字。

請注意這種方法可能並不完全準確，特別是對於包含複雜背景或者字體的圖片。如果您需要更高準確度的文字識別，可能需要使用更進階的方法，如機器學習模型。

可以轉成文字但的確不是那麼的精準

接下來有幾個方向，可能是說字體需要再另外訓練參考這個網站