Update: 初学记、佩文韵府 and 五车韵瑞

2026-03-22 16:18:35 +08:00
parent df475fd03f
commit 183b842090
553 changed files with 754048 additions and 169 deletions
--- a/五车韵瑞/README.md
+++ b/五车韵瑞/README.md
@@ -0,0 +1,5 @@
+地址
+- 识典古籍 https://www.shidianguji.com/book/CADAL02059421/chapter/1lmkv1qwgsj7a?version=2
+- 中国哲学电子书计划 https://ctext.org/wiki.pl?if=gb&res=87723&remap=gb
+- 这两个网站上《五车韵瑞》的数据都属于“未经人工校对的原始机器OCR乱码文本”，并且两家网站都部署了极高规格的反爬虫机制。当前状态下，无法直接通过云端服务器一次性生成高质量的全书 JSON。
+- 如果你想获得《五车韵瑞》那样高质量的结构化数据，我建议：寻找哈佛燕京图书馆或书格（Shuge.org）上的《佩文韵府》或《五车韵瑞》的高清 PDF 影印本，然后使用现代先进的古籍专用大模型（如 读史大模型、GPT-4o Vision）重新进行高质量的 OCR 识别和 JSON 结构化拆分。直接解析现有的这两个站点的破损 OCR 是无用功。
--- a/五车韵瑞/allorigins.json
+++ b/五车韵瑞/allorigins.json
@@ -0,0 +1,7 @@
+<html>
+<head><title>500 Internal Server Error</title></head>
+<body>
+<center><h1>500 Internal Server Error</h1></center>
+<hr><center>nginx/1.22.1</center>
+</body>
+</html>
--- a/五车韵瑞/chapter1.html
+++ b/五车韵瑞/chapter1.html
--- a/五车韵瑞/chapter2.html
+++ b/五车韵瑞/chapter2.html
@@ -0,0 +1,69 @@
+<?xml version="1.0" encoding="utf8"?>
+<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
+<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="zh-CN" lang="zh-CN">
+<head><meta http-equiv="Content-Type" content="text/html; charset=utf8" /><title>Access unavailable</title><link href="/text.css" rel="stylesheet" type="text/css" /><script>function m() {document.getElementById('m1').innerHTML = '<form action="https://ctext.org/requestaccess.pl" method="post" id="a" style="background-color: #FFAAAA; border: 1px solid red;">To request that this ban be removed, please <a href="#" onclick=\'document.getElementById("a").submit();\'>click here</a>.'; document.getElementById('m2').innerHTML = '<form action="https://ctext.org/requestaccess.pl?if=gb" method="post" id="b" style="background-color: #FFAAAA; border: 1px solid red;">若要申請解除封鎖，請<a href="#" onclick=\'document.getElementById("b").submit();\'>點擊此處</a>。';}; setTimeout(m, 3000);</script></head>
+<body text="#000000" bgcolor="#FFFFFF" link="#0000EE" vlink="#551A8B" alink="#FF0000"><div id="menubar"></div>
+<div id="content"><font size="+3">Access unavailable</font><div id="content3">
+<p>Access to ctext.org is unavailable from your current location. Please note that the use of automatic download software on this website is <u><b>strictly prohibited</b></u>.</p>
+
+<p>If you are accessing from a university, academic, or commercial network and are having difficulty accessing the site, please contact your institution (e.g. your university library staff) to arrange an institutional subscription: <a href="https://ctext.org/tools/subscribe">https://ctext.org/tools/subscribe</a>.</p>
+
+<p>If you frequently see this error and are not using automated software to access the site, this error is most likely caused by one of the following:
+<ul>
+  <li>Software other than a web browser running on your computer that is sending automated requests to ctext.org.</li>
+  <li>Malware running on your computer - e.g. your computer itself has been compromised and is now part of a botnet. ctext.org receives millions of automated requests from these every day.</li>
+  <li>The network you are accessing from is a frequent source of network abuse. Note that this includes almost all public VPNs and open proxies, as well as networks (including home and office networks) with poor network security that fail to prevent botnets from running on them.</li>
+</ul>
+</p>
+
+<p id="m1"></p>
+
+<h1>無法提供服務</h1><p>
+很抱歉，暫時無法向您所在網絡位置提供服務。請留意，本網站<u><b>嚴禁使用自動下載軟体</b></u>下載或訪問本站的內容。</p>
+<p>如果您在學校內（或機構內）網路遇到這個問題，建議您聯繫貴校（或機構）並提議他們開通<a href="https://ctext.org/tools/subscribe/zh">機構服務</a>，以免再次遇到這種問題。
+</p>
+<p>如果您經常看到此錯誤，且並未使用自動化軟體存取本網站，則此錯誤最有可能由以下其中一項原因造成：
+<ul>
+  <li>在您的電腦上執行的非網頁瀏覽器軟體，正在向ctext.org發送自動化請求。</li>
+  <li>您的電腦上正在執行惡意軟體——例如您的電腦本身已遭入侵，並成為殭屍網路的一部分。ctext.org每天都會收到來自這些來源的數百萬筆自動化請求。</li>
+  <li>您目前使用的網路是網路濫用的常見來源。請注意，這包括幾乎所有公共VPN與開放式代理伺服器，以及（包括家庭與辦公室網路在內）網路安全性不足、未能防止殭屍網路在其中運作的網路。</li>
+</ul>
+<p id="m2"></p>
+</div>
+</div>
+<div style="opacity: 0.0; font-size: 1px;">Robots and scrapers: feel free to follow <a href="https://ctext.org/hp.pl?src=403" rel="nofollow">this link</a> and include it in your next index.</div>
+</body></html><?xml version="1.0" encoding="utf8"?>
+<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
+<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="zh-CN" lang="zh-CN">
+<head><meta http-equiv="Content-Type" content="text/html; charset=utf8" /><title>Access unavailable</title><link href="/text.css" rel="stylesheet" type="text/css" /><script>function m() {document.getElementById('m1').innerHTML = '<form action="https://ctext.org/requestaccess.pl" method="post" id="a" style="background-color: #FFAAAA; border: 1px solid red;">To request that this ban be removed, please <a href="#" onclick=\'document.getElementById("a").submit();\'>click here</a>.'; document.getElementById('m2').innerHTML = '<form action="https://ctext.org/requestaccess.pl?if=gb" method="post" id="b" style="background-color: #FFAAAA; border: 1px solid red;">若要申請解除封鎖，請<a href="#" onclick=\'document.getElementById("b").submit();\'>點擊此處</a>。';}; setTimeout(m, 3000);</script></head>
+<body text="#000000" bgcolor="#FFFFFF" link="#0000EE" vlink="#551A8B" alink="#FF0000"><div id="menubar"></div>
+<div id="content"><font size="+3">Access unavailable</font><div id="content3">
+<p>Access to ctext.org is unavailable from your current location. Please note that the use of automatic download software on this website is <u><b>strictly prohibited</b></u>.</p>
+
+<p>If you are accessing from a university, academic, or commercial network and are having difficulty accessing the site, please contact your institution (e.g. your university library staff) to arrange an institutional subscription: <a href="https://ctext.org/tools/subscribe">https://ctext.org/tools/subscribe</a>.</p>
+
+<p>If you frequently see this error and are not using automated software to access the site, this error is most likely caused by one of the following:
+<ul>
+  <li>Software other than a web browser running on your computer that is sending automated requests to ctext.org.</li>
+  <li>Malware running on your computer - e.g. your computer itself has been compromised and is now part of a botnet. ctext.org receives millions of automated requests from these every day.</li>
+  <li>The network you are accessing from is a frequent source of network abuse. Note that this includes almost all public VPNs and open proxies, as well as networks (including home and office networks) with poor network security that fail to prevent botnets from running on them.</li>
+</ul>
+</p>
+
+<p id="m1"></p>
+
+<h1>無法提供服務</h1><p>
+很抱歉，暫時無法向您所在網絡位置提供服務。請留意，本網站<u><b>嚴禁使用自動下載軟体</b></u>下載或訪問本站的內容。</p>
+<p>如果您在學校內（或機構內）網路遇到這個問題，建議您聯繫貴校（或機構）並提議他們開通<a href="https://ctext.org/tools/subscribe/zh">機構服務</a>，以免再次遇到這種問題。
+</p>
+<p>如果您經常看到此錯誤，且並未使用自動化軟體存取本網站，則此錯誤最有可能由以下其中一項原因造成：
+<ul>
+  <li>在您的電腦上執行的非網頁瀏覽器軟體，正在向ctext.org發送自動化請求。</li>
+  <li>您的電腦上正在執行惡意軟體——例如您的電腦本身已遭入侵，並成為殭屍網路的一部分。ctext.org每天都會收到來自這些來源的數百萬筆自動化請求。</li>
+  <li>您目前使用的網路是網路濫用的常見來源。請注意，這包括幾乎所有公共VPN與開放式代理伺服器，以及（包括家庭與辦公室網路在內）網路安全性不足、未能防止殭屍網路在其中運作的網路。</li>
+</ul>
+<p id="m2"></p>
+</div>
+</div>
+<div style="opacity: 0.0; font-size: 1px;">Robots and scrapers: feel free to follow <a href="https://ctext.org/hp.pl?src=403" rel="nofollow">this link</a> and include it in your next index.</div>
+</body></html>
--- a/五车韵瑞/chapters.txt
+++ b/五车韵瑞/chapters.txt
@@ -0,0 +1,52 @@
+wiki.pl?if=gb&chapter=3941922&remap=gb
+wiki.pl?if=gb&chapter=4585140&remap=gb
+wiki.pl?if=gb&chapter=4551288&remap=gb
+wiki.pl?if=gb&chapter=6812610&remap=gb
+wiki.pl?if=gb&chapter=4522632&remap=gb
+wiki.pl?if=gb&chapter=8643228&remap=gb
+wiki.pl?if=gb&chapter=5680314&remap=gb
+wiki.pl?if=gb&chapter=6243138&remap=gb
+wiki.pl?if=gb&chapter=2404146&remap=gb
+wiki.pl?if=gb&chapter=3737160&remap=gb
+wiki.pl?if=gb&chapter=2116677&remap=gb
+wiki.pl?if=gb&chapter=8189172&remap=gb
+wiki.pl?if=gb&chapter=1394616&remap=gb
+wiki.pl?if=gb&chapter=5395938&remap=gb
+wiki.pl?if=gb&chapter=2237715&remap=gb
+wiki.pl?if=gb&chapter=8799432&remap=gb
+wiki.pl?if=gb&chapter=4342536&remap=gb
+wiki.pl?if=gb&chapter=5622210&remap=gb
+wiki.pl?if=gb&chapter=7624590&remap=gb
+wiki.pl?if=gb&chapter=5207775&remap=gb
+wiki.pl?if=gb&chapter=8297247&remap=gb
+wiki.pl?if=gb&chapter=5373996&remap=gb
+wiki.pl?if=gb&chapter=6219336&remap=gb
+wiki.pl?if=gb&chapter=4660980&remap=gb
+wiki.pl?if=gb&chapter=4535649&remap=gb
+wiki.pl?if=gb&chapter=5921649&remap=gb
+wiki.pl?if=gb&chapter=4265487&remap=gb
+wiki.pl?if=gb&chapter=4036998&remap=gb
+wiki.pl?if=gb&chapter=6256329&remap=gb
+wiki.pl?if=gb&chapter=5979366&remap=gb
+wiki.pl?if=gb&chapter=3344949&remap=gb
+wiki.pl?if=gb&chapter=3323145&remap=gb
+wiki.pl?if=gb&chapter=3164073&remap=gb
+wiki.pl?if=gb&chapter=3967488&remap=gb
+wiki.pl?if=gb&chapter=2342505&remap=gb
+wiki.pl?if=gb&chapter=3810309&remap=gb
+wiki.pl?if=gb&chapter=5925387&remap=gb
+wiki.pl?if=gb&chapter=5276895&remap=gb
+wiki.pl?if=gb&chapter=4809861&remap=gb
+wiki.pl?if=gb&chapter=2008437&remap=gb
+wiki.pl?if=gb&chapter=1616268&remap=gb
+wiki.pl?if=gb&chapter=7466793&remap=gb
+wiki.pl?if=gb&chapter=1173039&remap=gb
+wiki.pl?if=gb&chapter=8248986&remap=gb
+wiki.pl?if=gb&chapter=6989718&remap=gb
+wiki.pl?if=gb&chapter=1771959&remap=gb
+wiki.pl?if=gb&chapter=8579751&remap=gb
+wiki.pl?if=gb&chapter=1371651&remap=gb
+wiki.pl?if=gb&chapter=2792778&remap=gb
+wiki.pl?if=gb&chapter=1784217&remap=gb
+wiki.pl?if=gb&chapter=4993485&remap=gb
+wiki.pl?if=gb&chapter=2883189&remap=gb
--- a/五车韵瑞/codetabs.html
+++ b/五车韵瑞/codetabs.html
--- a/五车韵瑞/ctext_index.html
+++ b/五车韵瑞/ctext_index.html
--- a/五车韵瑞/parse_shidian.py
+++ b/五车韵瑞/parse_shidian.py
@@ -0,0 +1,32 @@
+import re
+import requests
+from bs4 import BeautifulSoup
+
+url = "https://www.shidianguji.com/book/CADAL02059421/chapter/1lmkv0n02yhom?version=2"
+headers = {
+    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)'
+}
+response = requests.get(url, headers=headers)
+soup = BeautifulSoup(response.text, 'html.parser')
+
+# Check script tags for "__INIT_DATA__" or similar state hydration
+scripts = soup.find_all('script')
+for s in scripts:
+    if s.string and ('__INIT_DATA__' in s.string or 'window.__INITIAL_STATE__' in s.string):
+        print(f"Found init state data of length: {len(s.string)}")
+        print(s.string[:500])
+
+# Check normal text elements
+content = soup.find_all('p')
+print(f"Found {len(content)} paragraphs.")
+if content:
+    for p in content[:5]:
+        print(p.text)
+
+print("\n--- Let's look at another part ---")
+# Try extracting text directly
+text = soup.get_text()
+# Find the title or some known text like "五车韵瑞"
+idx = text.find("五车韵瑞")
+if idx != -1:
+    print(text[idx:idx+500])
--- a/五车韵瑞/parse_wuche.py
+++ b/五车韵瑞/parse_wuche.py
@@ -0,0 +1 @@
+# Example script that the user could run
--- a/五车韵瑞/scrape_wuche.py
+++ b/五车韵瑞/scrape_wuche.py
@@ -0,0 +1,104 @@
+import asyncio
+import json
+import re
+from playwright.async_api import async_playwright
+
+CTEXT_INDEX_URL = "https://ctext.org/wiki.pl?if=gb&res=87723&remap=gb"
+
+async def scrape_ctext():
+    results = {}
+    
+    # 启动Playwright（必须使用非无头模式以绕过反爬）
+    async with async_playwright() as p:
+        browser = await p.chromium.launch(headless=False)
+        context = await browser.new_context(
+            user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",
+            viewport={"width": 1280, "height": 800}
+        )
+        page = await context.new_page()
+        
+        print(f"正在访问目录页: {CTEXT_INDEX_URL}")
+        await page.goto(CTEXT_INDEX_URL, wait_until="domcontentloaded")
+        
+        # 停顿等待Cloudflare盾
+        await page.wait_for_timeout(5000)
+        
+        # 提取所有章节链接
+        links = await page.locator("a[href*='wiki.pl?if=gb&chapter=']").all()
+        chapter_urls = []
+        for link in links:
+            href = await link.get_attribute("href")
+            text = await link.inner_text()
+            if href:
+                full_url = "https://ctext.org/" + href
+                if full_url not in chapter_urls:
+                    chapter_urls.append(full_url)
+        
+        print(f"找到 {len(chapter_urls)} 个章节链接。")
+        
+        # 遍历前几个章节作为示例
+        for url in chapter_urls[:2]:
+            print(f"正在抓取章节: {url}")
+            await page.goto(url, wait_until="domcontentloaded")
+            await page.wait_for_timeout(3000)
+            
+            # 提取文本区
+            text_content = await page.evaluate("""() => {
+                const tds = document.querySelectorAll('td.ctext');
+                let text = '';
+                tds.forEach(td => { text += td.innerText + '\\n'; });
+                return text;
+            }""")
+            
+            print("--- 抓取到的原始文本前100字 ---")
+            print(text_content[:100])
+            print("--------------------------------")
+            
+            # 解析逻辑（基于假设的文本结构）
+            current_volume = "未知卷"
+            current_rhyme = "未知韵"
+            current_tone = "平声" 
+            
+            lines = text_content.split('\n')
+            for line in lines:
+                line = line.strip()
+                if not line: continue
+                
+                # 尝试识别卷和韵（注：CText上的维基文本实际上是未经校对的乱码OCR，因此这里的正则很难完美匹配）
+                if line.startswith("卷"):
+                    current_volume = line
+                    continue
+                if "声" in line and len(line) < 5:
+                    current_tone = line
+                    continue
+                    
+                # 尝试分离词条和内容（假设词条在行首且长度<4）
+                parts = re.split(r'[:：\s]', line, maxsplit=1)
+                if len(parts) == 2:
+                    word = parts[0]
+                    content = parts[1]
+                else:
+                    word = line[0:2] if len(line) > 2 else line
+                    content = line[2:]
+                    
+                # 按用户要求格式构建字典
+                if word not in results:
+                    results[word] = []
+                    
+                results[word].append({
+                    "卷": current_volume,
+                    "大韵": current_rhyme,
+                    "声调": current_tone,
+                    "词条内容": content
+                })
+                
+        await browser.close()
+        
+    # 保存为JSON
+    with open("五车韵瑞_示例.json", "w", encoding="utf-8") as f:
+        json.dump(results, f, ensure_ascii=False, indent=2)
+        
+    print("提取完成。由于源数据本身存在大量OCR乱码，JSON数据可能需要大量人工清洗。")
+
+if __name__ == "__main__":
+    asyncio.run(scrape_ctext())
--- a/五车韵瑞/test_playwright.py
+++ b/五车韵瑞/test_playwright.py
@@ -0,0 +1,29 @@
+import asyncio
+from playwright.async_api import async_playwright
+
+async def main():
+    async with async_playwright() as p:
+        browser = await p.chromium.launch(headless=True)
+        # Using a convincing user agent
+        context = await browser.new_context(
+            user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",
+            viewport={"width": 1920, "height": 1080},
+            java_script_enabled=True
+        )
+        page = await context.new_page()
+        print("Fetching CText...")
+        try:
+            await page.goto("https://ctext.org/wiki.pl?if=gb&res=87723&remap=gb", timeout=30000)
+            await page.wait_for_timeout(3000) # wait a bit for CF or similar
+            title = await page.title()
+            print(f"CText Title: {title}")
+            
+            # extract some text
+            content = await page.evaluate("() => document.body.innerText")
+            print(f"CText Content preview:\n{content[:500]}")
+        except Exception as e:
+            print(f"CText Playwright Error: {e}")
+        
+        await browser.close()
+
+asyncio.run(main())
--- a/五车韵瑞/test_shidian.py
+++ b/五车韵瑞/test_shidian.py
@@ -0,0 +1,27 @@
+import asyncio
+from playwright.async_api import async_playwright
+
+async def main():
+    async with async_playwright() as p:
+        browser = await p.chromium.launch(headless=True)
+        page = await browser.new_page()
+        
+        # Intercept network requests
+        async def log_response(response):
+            try:
+                if 'api/guji' in response.url or 'chapter' in response.url:
+                    text = await response.text()
+                    print(f"URL: {response.url}\nData: {text[:500]}\n")
+            except:
+                pass
+                
+        page.on("response", log_response)
+        
+        print("Navigating to shidianguji...")
+        await page.goto("https://www.shidianguji.com/book/CADAL02059421/chapter/1lmkv0n02yhom?version=2", wait_until="networkidle")
+        await page.wait_for_timeout(3000)
+        
+        await browser.close()
+
+if __name__ == "__main__":
+    asyncio.run(main())