为了爬虫速成基础知识的笔记

前端基础知识

html

HTML 元素

开始标签	元素内容	结束标签
<p>	这是一个段落	</p>
	这是一个链接	</a>
	换行

如<br>是空元素

HTML 链接语法

<a href="url">链接文本</a>

href 属性描述了链接的目标。.

HTML 属性

属性	描述
class	为html元素定义一个或多个类名（classname）(类名从样式文件引入)
id	定义元素的唯一id
style	规定元素的行内样式（inline style）
title	描述了元素的额外信息 (作为工具条使用)

html插入CSS

CSS 是在 HTML 4 开始使用的,是为了更好的渲染HTML元素而引入的.

CSS 可以通过以下方式添加到HTML中:

内联样式- 在HTML元素中使用”style” 属性
内部样式表 -在HTML文档头部 <head> 区域使用
外部引用 - 使用外部 CSS 文件

最好的方式是通过外部引用CSS文件.

css样例：一个按钮

https://codepen.io/FelipeMarcos/pen/tfhEg

内部（一般用于单个特别）

<head>
<style type="text/css">
body {background-color:yellow;}
p {color:blue;}
</style>
</head>

或

$('.title').css({ //添加样式，固定住Tab
                    'top': 0,
                    'left': 0,
                    'width': '100%',
                    'z-index': 999,
                    'position': 'fixed'
                });

(利用js插入。

此处用了jquery

js

语法等略此处记录下dom的学习

dom

寻找元素:

**通过ID寻找**：var a=document.getElementById("xiaoming")

同样的有getElenmentByTag和GetElementByClass

向指定元素添加事件句柄:addEventListener()

document.getElementById("myBtn").addEventListener("click", displayDate);
function displayDate() {
    document.getElementById("demo").innerHTML = Date();
}

removeEventListener() 方法移除由 addEventListener() 方法添加的事件句柄

参考手册：runoob.com/jsref/dom-obj-event.html

jquery

是js的一个库。用更少的代码做更多的事情。

下面是个用jquery调CSS的例子

<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8">
<script src="https://cdn.staticfile.org/jquery/1.10.2/jquery.min.js">
</script>
<script>
$(document).ready(function(){
  $("button").click(function(){
    $("h1,h2,p").addClass("blue");
    $("div").addClass("important");
  });
});
</script>
<style type="text/css">
.important
{
	font-weight:bold;
	font-size:xx-large;
}
.blue
{
	color:blue;
}
</style>
</head>
<body>

<h1>标题 1</h1>
<h2>标题 2</h2>
<p>这是一个段落。</p>
<p>这是另外一个段落。</p>
<div>这是一些重要的文本!</div>
<br>
<button>为元素添加 class</button>

</body>
</html>

Xml

XML 不是 HTML 的替代。

XML 和 HTML 为不同的目的而设计：

XML 被设计用来传输和存储数据，其焦点是数据的内容。
HTML 被设计用来显示数据，其焦点是数据的外观。

HTML 旨在显示信息，而 XML 旨在传输信息。

XML是一种树结构

<bookstore>
    <book category="COOKING">
        <title lang="en">Everyday Italian</title>
        <author>Giada De Laurentiis</author>
        <year>2005</year>
        <price>30.00</price>
    </book>
    <book category="CHILDREN">
        <title lang="en">Harry Potter</title>
        <author>J K. Rowling</author>
        <year>2005</year>
        <price>29.99</price>
    </book>
    <book category="WEB">
        <title lang="en">Learning XML</title>
        <author>Erik T. Ray</author>
        <year>2003</year>
        <price>39.95</price>
    </book>
</bookstore>

实例中的根元素是。文档中的所有元素都被包含在中。

元素有 4 个子元素：、<author>、<year>、<price> 语法详见https://www.runoob.com/xml/xml-syntax.html ## Xpath Xpath在爬数据过程中很重要。 ### xpath简介 > 1. xpath 用于在XML文档中通过元素和属性进行导航 > 2. xpath 使用路径表达式来选取 XML 或 HTML 中的节点或者节点集。 ### xpath 节点 > 1. 在 xpath 中有七种类型的节点： > > > - 元素 > > - 属性 > > - 文本 > > - 命名空间 > > - 处理指令 > > - 注释 > > - 文档根节点 > > 1. 节点关系 > > > - 父级 > > - 子级 > > - 同辈级 > > - 先辈级 > > - 后代级 ### xpath 语法 | 表达式 | 描述 | 语法示例 | | ------------ | -------------------------------------------------------- | ------------------------------------------------------------ | | nodename | 选取此节点的所有子节点,用户选取标签名 | book | | / | 从根节点选取 | /book | | // | 从匹配选择的当前节点选择文档中的节点，而不考虑他们的位置 | //price 选择任意位置的 price 标签 | | . | 选取当前节点 | -- | | .. | 选取当前节点的父节点 | -- | | @ | 选取属性 | //@lang 选择lang标签的所有属性 | | [n] | 选取属于某元素下的第n个子元素,索引从1开始 | /bookstore/book[2] | | [last()] | 选取属于某个元素的最后一个元素 | /book[last()] last() 当作函数理解 | | [last()-1] | 选取属于某个元素的最后一个元素 | /book[last()-1] | | [position()] | 做元素定位使用 | /book[position()<3 选择最前面两个的book元素] | | [@lang] | 选择有属性为lang的元素 | //book[@lang] 选择有lang属性的book标签 | | [@class=""] | 选择class属性为某个值的元素 | //div[@class="item"] 选择类名为item的div标签 | | * | 匹配任何元素节点 | -- | | @* | 匹配任何属性节点 | -- | | \| | 表示"或" | //bookstore/book/title\|//bookstore/book/price 表示匹配文章中 bookstore 下 book 标签下的 title 或这 price 标签 | ### 语法示例以下为python脚本 ```python from lxml import etree html = ''' <bookstore> <book price="100" category="cooking"> <title lang="en">Everyday Italian Giada De Laurentiis 2005 30.00 Harry Potter J K. Rowling 2005 29.99 XQuery Kick Start James McGovern Per Bothner Kurt Cagle James Linn Vaidyanathan Nagarajan 2003 49.99 Learning XML Erik T. Ray 2003 39.95 ''' # 将字符串转为html对象 html = etree.HTML(html) # 获取bookstore下book标签的title标签 res = html.xpath('//bookstore/book/title') print(res) # 返回值（列表中是对象）：[<Element title at 0x25c54df7388>, <Element title at 0x25c54df73c8>, <Element title at 0x25c54df7448>, <Element title at 0x25c54df7488>] # 获取bookstore下book标签的title标签中category属性值为“web” res = html.xpath('//bookstore/book/title[@category="web"]') print(res) # 返回值（列表中是对象）：[<Element title at 0x1f0ec727388>] # 获取bookstore下book标签的title标签中category属性值为“web”的标签的值 res = html.xpath('//bookstore/book/title[@category="web"]/text()') print(res) # 返回值（列表中是字符串）：['XQuery Kick Start'] # 获取bookstore下book的price属性为100的标签下的title()标签的值 res = html.xpath('//bookstore/book[@price="100"]/title/text()') print(res) # 返回值（列表中的是字符串）：['Everyday Italian', 'Learning XML'] # 获取bookstore 下book下第1个标签的所有属性 [1]:表示选择第几个标签 @*：表示选择所有的属性 res = html.xpath('//bookstore/book[1]/@*') print(res) # 返回值（列表中是字符串） # 获取bookstore 下book带有任意属性的标签 res = html.xpath('//bookstore/book[@*]') print(res) # 返回值 (列表中的是对象)：[<Element book at 0x17a7fb47388>, <Element book at 0x17a7fb473c8>, <Element book at 0x17a7fb47448>, <Element book at 0x17a7fb47488>] # 获取带有属性的title元素的值 res = html.xpath('//title[@*]/text()') print(res) # 返回值（列表中的是字符串）：['Everyday Italian', 'Harry Potter', 'XQuery Kick Start'] # 获取book下的 title 和 price 标签 res = html.xpath('//book/title | //book/price') print(res) # 返回值：（列表中的是对象）：[<Element title at 0x256dc8773c8>, <Element price at 0x256dc877448>, <Element title at 0x256dc877488>, <Element price at 0x256dc8774c8>, <Element title at 0x256dc877508>, <Element price at 0x256dc877548>, <Element title at 0x256dc877588>, <Element price at 0x256dc8775c8>] # 获取book有category或price属性的值 res = html.xpath('//book/@category | //book/@price') print(res) # 返回值（列表中的是字符串）：['100', 'cooking', 'children', 'web', 'web', '100'] ```