Update: 初学记、佩文韵府 and 五车韵瑞

This commit is contained in:
denglifan
2026-03-22 16:18:35 +08:00
parent df475fd03f
commit 183b842090
553 changed files with 754048 additions and 169 deletions

5
五车韵瑞/README.md Normal file
View File

@@ -0,0 +1,5 @@
地址
- 识典古籍 https://www.shidianguji.com/book/CADAL02059421/chapter/1lmkv1qwgsj7a?version=2
- 中国哲学电子书计划 https://ctext.org/wiki.pl?if=gb&res=87723&remap=gb
- 这两个网站上《五车韵瑞》的数据都属于“未经人工校对的原始机器OCR乱码文本”并且两家网站都部署了极高规格的反爬虫机制。当前状态下无法直接通过云端服务器一次性生成高质量的全书 JSON。
- 如果你想获得《五车韵瑞》那样高质量的结构化数据我建议寻找哈佛燕京图书馆或书格Shuge.org上的《佩文韵府》或《五车韵瑞》的高清 PDF 影印本,然后使用现代先进的古籍专用大模型(如 读史大模型、GPT-4o Vision重新进行高质量的 OCR 识别和 JSON 结构化拆分。直接解析现有的这两个站点的破损 OCR 是无用功。

View File

@@ -0,0 +1,7 @@
<html>
<head><title>500 Internal Server Error</title></head>
<body>
<center><h1>500 Internal Server Error</h1></center>
<hr><center>nginx/1.22.1</center>
</body>
</html>

File diff suppressed because one or more lines are too long

View File

@@ -0,0 +1,69 @@
<?xml version="1.0" encoding="utf8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="zh-CN" lang="zh-CN">
<head><meta http-equiv="Content-Type" content="text/html; charset=utf8" /><title>Access unavailable</title><link href="/text.css" rel="stylesheet" type="text/css" /><script>function m() {document.getElementById('m1').innerHTML = '<form action="https://ctext.org/requestaccess.pl" method="post" id="a" style="background-color: #FFAAAA; border: 1px solid red;">To request that this ban be removed, please <a href="#" onclick=\'document.getElementById("a").submit();\'>click here</a>.'; document.getElementById('m2').innerHTML = '<form action="https://ctext.org/requestaccess.pl?if=gb" method="post" id="b" style="background-color: #FFAAAA; border: 1px solid red;">若要申請解除封鎖,請<a href="#" onclick=\'document.getElementById("b").submit();\'>點擊此處</a>。';}; setTimeout(m, 3000);</script></head>
<body text="#000000" bgcolor="#FFFFFF" link="#0000EE" vlink="#551A8B" alink="#FF0000"><div id="menubar"></div>
<div id="content"><font size="+3">Access unavailable</font><div id="content3">
<p>Access to ctext.org is unavailable from your current location. Please note that the use of automatic download software on this website is <u><b>strictly prohibited</b></u>.</p>
<p>If you are accessing from a university, academic, or commercial network and are having difficulty accessing the site, please contact your institution (e.g. your university library staff) to arrange an institutional subscription: <a href="https://ctext.org/tools/subscribe">https://ctext.org/tools/subscribe</a>.</p>
<p>If you frequently see this error and are not using automated software to access the site, this error is most likely caused by one of the following:
<ul>
<li>Software other than a web browser running on your computer that is sending automated requests to ctext.org.</li>
<li>Malware running on your computer - e.g. your computer itself has been compromised and is now part of a botnet. ctext.org receives millions of automated requests from these every day.</li>
<li>The network you are accessing from is a frequent source of network abuse. Note that this includes almost all public VPNs and open proxies, as well as networks (including home and office networks) with poor network security that fail to prevent botnets from running on them.</li>
</ul>
</p>
<p id="m1"></p>
<h1>無法提供服務</h1><p>
很抱歉,暫時無法向您所在網絡位置提供服務。請留意,本網站<u><b>嚴禁使用自動下載軟体</b></u>下載或訪問本站的內容。</p>
<p>如果您在學校內(或機構內)網路遇到這個問題,建議您聯繫貴校(或機構)並提議他們開通<a href="https://ctext.org/tools/subscribe/zh">機構服務</a>,以免再次遇到這種問題。
</p>
<p>如果您經常看到此錯誤,且並未使用自動化軟體存取本網站,則此錯誤最有可能由以下其中一項原因造成:
<ul>
<li>在您的電腦上執行的非網頁瀏覽器軟體正在向ctext.org發送自動化請求。</li>
<li>您的電腦上正在執行惡意軟體——例如您的電腦本身已遭入侵並成為殭屍網路的一部分。ctext.org每天都會收到來自這些來源的數百萬筆自動化請求。</li>
<li>您目前使用的網路是網路濫用的常見來源。請注意這包括幾乎所有公共VPN與開放式代理伺服器以及包括家庭與辦公室網路在內網路安全性不足、未能防止殭屍網路在其中運作的網路。</li>
</ul>
<p id="m2"></p>
</div>
</div>
<div style="opacity: 0.0; font-size: 1px;">Robots and scrapers: feel free to follow <a href="https://ctext.org/hp.pl?src=403" rel="nofollow">this link</a> and include it in your next index.</div>
</body></html><?xml version="1.0" encoding="utf8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="zh-CN" lang="zh-CN">
<head><meta http-equiv="Content-Type" content="text/html; charset=utf8" /><title>Access unavailable</title><link href="/text.css" rel="stylesheet" type="text/css" /><script>function m() {document.getElementById('m1').innerHTML = '<form action="https://ctext.org/requestaccess.pl" method="post" id="a" style="background-color: #FFAAAA; border: 1px solid red;">To request that this ban be removed, please <a href="#" onclick=\'document.getElementById("a").submit();\'>click here</a>.'; document.getElementById('m2').innerHTML = '<form action="https://ctext.org/requestaccess.pl?if=gb" method="post" id="b" style="background-color: #FFAAAA; border: 1px solid red;">若要申請解除封鎖,請<a href="#" onclick=\'document.getElementById("b").submit();\'>點擊此處</a>。';}; setTimeout(m, 3000);</script></head>
<body text="#000000" bgcolor="#FFFFFF" link="#0000EE" vlink="#551A8B" alink="#FF0000"><div id="menubar"></div>
<div id="content"><font size="+3">Access unavailable</font><div id="content3">
<p>Access to ctext.org is unavailable from your current location. Please note that the use of automatic download software on this website is <u><b>strictly prohibited</b></u>.</p>
<p>If you are accessing from a university, academic, or commercial network and are having difficulty accessing the site, please contact your institution (e.g. your university library staff) to arrange an institutional subscription: <a href="https://ctext.org/tools/subscribe">https://ctext.org/tools/subscribe</a>.</p>
<p>If you frequently see this error and are not using automated software to access the site, this error is most likely caused by one of the following:
<ul>
<li>Software other than a web browser running on your computer that is sending automated requests to ctext.org.</li>
<li>Malware running on your computer - e.g. your computer itself has been compromised and is now part of a botnet. ctext.org receives millions of automated requests from these every day.</li>
<li>The network you are accessing from is a frequent source of network abuse. Note that this includes almost all public VPNs and open proxies, as well as networks (including home and office networks) with poor network security that fail to prevent botnets from running on them.</li>
</ul>
</p>
<p id="m1"></p>
<h1>無法提供服務</h1><p>
很抱歉,暫時無法向您所在網絡位置提供服務。請留意,本網站<u><b>嚴禁使用自動下載軟体</b></u>下載或訪問本站的內容。</p>
<p>如果您在學校內(或機構內)網路遇到這個問題,建議您聯繫貴校(或機構)並提議他們開通<a href="https://ctext.org/tools/subscribe/zh">機構服務</a>,以免再次遇到這種問題。
</p>
<p>如果您經常看到此錯誤,且並未使用自動化軟體存取本網站,則此錯誤最有可能由以下其中一項原因造成:
<ul>
<li>在您的電腦上執行的非網頁瀏覽器軟體正在向ctext.org發送自動化請求。</li>
<li>您的電腦上正在執行惡意軟體——例如您的電腦本身已遭入侵並成為殭屍網路的一部分。ctext.org每天都會收到來自這些來源的數百萬筆自動化請求。</li>
<li>您目前使用的網路是網路濫用的常見來源。請注意這包括幾乎所有公共VPN與開放式代理伺服器以及包括家庭與辦公室網路在內網路安全性不足、未能防止殭屍網路在其中運作的網路。</li>
</ul>
<p id="m2"></p>
</div>
</div>
<div style="opacity: 0.0; font-size: 1px;">Robots and scrapers: feel free to follow <a href="https://ctext.org/hp.pl?src=403" rel="nofollow">this link</a> and include it in your next index.</div>
</body></html>

52
五车韵瑞/chapters.txt Normal file
View File

@@ -0,0 +1,52 @@
wiki.pl?if=gb&chapter=3941922&remap=gb
wiki.pl?if=gb&chapter=4585140&remap=gb
wiki.pl?if=gb&chapter=4551288&remap=gb
wiki.pl?if=gb&chapter=6812610&remap=gb
wiki.pl?if=gb&chapter=4522632&remap=gb
wiki.pl?if=gb&chapter=8643228&remap=gb
wiki.pl?if=gb&chapter=5680314&remap=gb
wiki.pl?if=gb&chapter=6243138&remap=gb
wiki.pl?if=gb&chapter=2404146&remap=gb
wiki.pl?if=gb&chapter=3737160&remap=gb
wiki.pl?if=gb&chapter=2116677&remap=gb
wiki.pl?if=gb&chapter=8189172&remap=gb
wiki.pl?if=gb&chapter=1394616&remap=gb
wiki.pl?if=gb&chapter=5395938&remap=gb
wiki.pl?if=gb&chapter=2237715&remap=gb
wiki.pl?if=gb&chapter=8799432&remap=gb
wiki.pl?if=gb&chapter=4342536&remap=gb
wiki.pl?if=gb&chapter=5622210&remap=gb
wiki.pl?if=gb&chapter=7624590&remap=gb
wiki.pl?if=gb&chapter=5207775&remap=gb
wiki.pl?if=gb&chapter=8297247&remap=gb
wiki.pl?if=gb&chapter=5373996&remap=gb
wiki.pl?if=gb&chapter=6219336&remap=gb
wiki.pl?if=gb&chapter=4660980&remap=gb
wiki.pl?if=gb&chapter=4535649&remap=gb
wiki.pl?if=gb&chapter=5921649&remap=gb
wiki.pl?if=gb&chapter=4265487&remap=gb
wiki.pl?if=gb&chapter=4036998&remap=gb
wiki.pl?if=gb&chapter=6256329&remap=gb
wiki.pl?if=gb&chapter=5979366&remap=gb
wiki.pl?if=gb&chapter=3344949&remap=gb
wiki.pl?if=gb&chapter=3323145&remap=gb
wiki.pl?if=gb&chapter=3164073&remap=gb
wiki.pl?if=gb&chapter=3967488&remap=gb
wiki.pl?if=gb&chapter=2342505&remap=gb
wiki.pl?if=gb&chapter=3810309&remap=gb
wiki.pl?if=gb&chapter=5925387&remap=gb
wiki.pl?if=gb&chapter=5276895&remap=gb
wiki.pl?if=gb&chapter=4809861&remap=gb
wiki.pl?if=gb&chapter=2008437&remap=gb
wiki.pl?if=gb&chapter=1616268&remap=gb
wiki.pl?if=gb&chapter=7466793&remap=gb
wiki.pl?if=gb&chapter=1173039&remap=gb
wiki.pl?if=gb&chapter=8248986&remap=gb
wiki.pl?if=gb&chapter=6989718&remap=gb
wiki.pl?if=gb&chapter=1771959&remap=gb
wiki.pl?if=gb&chapter=8579751&remap=gb
wiki.pl?if=gb&chapter=1371651&remap=gb
wiki.pl?if=gb&chapter=2792778&remap=gb
wiki.pl?if=gb&chapter=1784217&remap=gb
wiki.pl?if=gb&chapter=4993485&remap=gb
wiki.pl?if=gb&chapter=2883189&remap=gb

File diff suppressed because one or more lines are too long

View File

View File

@@ -0,0 +1,32 @@
import re
import requests
from bs4 import BeautifulSoup
url = "https://www.shidianguji.com/book/CADAL02059421/chapter/1lmkv0n02yhom?version=2"
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)'
}
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, 'html.parser')
# Check script tags for "__INIT_DATA__" or similar state hydration
scripts = soup.find_all('script')
for s in scripts:
if s.string and ('__INIT_DATA__' in s.string or 'window.__INITIAL_STATE__' in s.string):
print(f"Found init state data of length: {len(s.string)}")
print(s.string[:500])
# Check normal text elements
content = soup.find_all('p')
print(f"Found {len(content)} paragraphs.")
if content:
for p in content[:5]:
print(p.text)
print("\n--- Let's look at another part ---")
# Try extracting text directly
text = soup.get_text()
# Find the title or some known text like "五车韵瑞"
idx = text.find("五车韵瑞")
if idx != -1:
print(text[idx:idx+500])

View File

@@ -0,0 +1 @@
# Example script that the user could run

View File

@@ -0,0 +1,104 @@
import asyncio
import json
import re
from playwright.async_api import async_playwright
CTEXT_INDEX_URL = "https://ctext.org/wiki.pl?if=gb&res=87723&remap=gb"
async def scrape_ctext():
results = {}
# 启动Playwright必须使用非无头模式以绕过反爬
async with async_playwright() as p:
browser = await p.chromium.launch(headless=False)
context = await browser.new_context(
user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",
viewport={"width": 1280, "height": 800}
)
page = await context.new_page()
print(f"正在访问目录页: {CTEXT_INDEX_URL}")
await page.goto(CTEXT_INDEX_URL, wait_until="domcontentloaded")
# 停顿等待Cloudflare盾
await page.wait_for_timeout(5000)
# 提取所有章节链接
links = await page.locator("a[href*='wiki.pl?if=gb&chapter=']").all()
chapter_urls = []
for link in links:
href = await link.get_attribute("href")
text = await link.inner_text()
if href:
full_url = "https://ctext.org/" + href
if full_url not in chapter_urls:
chapter_urls.append(full_url)
print(f"找到 {len(chapter_urls)} 个章节链接。")
# 遍历前几个章节作为示例
for url in chapter_urls[:2]:
print(f"正在抓取章节: {url}")
await page.goto(url, wait_until="domcontentloaded")
await page.wait_for_timeout(3000)
# 提取文本区
text_content = await page.evaluate("""() => {
const tds = document.querySelectorAll('td.ctext');
let text = '';
tds.forEach(td => { text += td.innerText + '\\n'; });
return text;
}""")
print("--- 抓取到的原始文本前100字 ---")
print(text_content[:100])
print("--------------------------------")
# 解析逻辑(基于假设的文本结构)
current_volume = "未知卷"
current_rhyme = "未知韵"
current_tone = "平声"
lines = text_content.split('\n')
for line in lines:
line = line.strip()
if not line: continue
# 尝试识别卷和韵CText上的维基文本实际上是未经校对的乱码OCR因此这里的正则很难完美匹配
if line.startswith(""):
current_volume = line
continue
if "" in line and len(line) < 5:
current_tone = line
continue
# 尝试分离词条和内容(假设词条在行首且长度<4
parts = re.split(r'[:\s]', line, maxsplit=1)
if len(parts) == 2:
word = parts[0]
content = parts[1]
else:
word = line[0:2] if len(line) > 2 else line
content = line[2:]
# 按用户要求格式构建字典
if word not in results:
results[word] = []
results[word].append({
"": current_volume,
"大韵": current_rhyme,
"声调": current_tone,
"词条内容": content
})
await browser.close()
# 保存为JSON
with open("五车韵瑞_示例.json", "w", encoding="utf-8") as f:
json.dump(results, f, ensure_ascii=False, indent=2)
print("提取完成。由于源数据本身存在大量OCR乱码JSON数据可能需要大量人工清洗。")
if __name__ == "__main__":
asyncio.run(scrape_ctext())

View File

@@ -0,0 +1,29 @@
import asyncio
from playwright.async_api import async_playwright
async def main():
async with async_playwright() as p:
browser = await p.chromium.launch(headless=True)
# Using a convincing user agent
context = await browser.new_context(
user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",
viewport={"width": 1920, "height": 1080},
java_script_enabled=True
)
page = await context.new_page()
print("Fetching CText...")
try:
await page.goto("https://ctext.org/wiki.pl?if=gb&res=87723&remap=gb", timeout=30000)
await page.wait_for_timeout(3000) # wait a bit for CF or similar
title = await page.title()
print(f"CText Title: {title}")
# extract some text
content = await page.evaluate("() => document.body.innerText")
print(f"CText Content preview:\n{content[:500]}")
except Exception as e:
print(f"CText Playwright Error: {e}")
await browser.close()
asyncio.run(main())

View File

@@ -0,0 +1,27 @@
import asyncio
from playwright.async_api import async_playwright
async def main():
async with async_playwright() as p:
browser = await p.chromium.launch(headless=True)
page = await browser.new_page()
# Intercept network requests
async def log_response(response):
try:
if 'api/guji' in response.url or 'chapter' in response.url:
text = await response.text()
print(f"URL: {response.url}\nData: {text[:500]}\n")
except:
pass
page.on("response", log_response)
print("Navigating to shidianguji...")
await page.goto("https://www.shidianguji.com/book/CADAL02059421/chapter/1lmkv0n02yhom?version=2", wait_until="networkidle")
await page.wait_for_timeout(3000)
await browser.close()
if __name__ == "__main__":
asyncio.run(main())