Spark的安装与配置

 大数据   大苹果   2024-12-17 09:16   262

引言

Apache Spark作为一个强大的分布式计算框架,广泛应用于大数据分析、实时流处理、机器学习等多个领域。对于开发者和数据工程师来说,正确安装和配置Spark环境是使用Spark的第一步。本文将详细介绍如何在不同操作系统(Linux、Windows、Mac、Ubuntu)上安装Spark,配置Spark环境变量,配置Spark集群,选择本地模式或集群模式,以及如何安装和配置Hadoop(如涉及HDFS)。这篇博客将帮助你从零开始,顺利搭建Spark环境并运行第一个Spark应用。


一、Spark的安装与配置概述

在开始安装之前,需要先了解几个关键概念:

  • 本地模式(Local Mode):适用于单机环境,Spark的所有计算任务都在一个进程中执行,通常用于开发和测试。
  • 集群模式(Cluster Mode):适用于分布式环境,Spark的计算任务分布到集群中的多个节点上执行,适合大规模数据处理。

Spark可以在不同的操作系统上进行安装,下面我们分别讨论在 LinuxWindowsMacUbuntu 上的安装步骤。


二、在Linux上安装与配置Spark

2.1 安装Spark

在Linux上安装Spark的步骤如下:

  1. 安装Java

Spark是用Java编写的,因此你需要先安装Java(推荐使用JDK 8或更高版本)。可以通过以下命令安装:

sudo apt update
sudo apt install openjdk-8-jdk
  1. 下载并安装Spark

访问Apache Spark的官方网站https://spark.apache.org/downloads.html,选择合适的版本和预构建的Hadoop版本进行下载。

# 下载Spark(假设下载的是2.4.7版本)
wget https://archive.apache.org/dist/spark/spark-2.4.7/spark-2.4.7-bin-hadoop2.7.tgz

# 解压下载的文件
tar -xvf spark-2.4.7-bin-hadoop2.7.tgz

# 移动到合适的目录
mv spark-2.4.7-bin-hadoop2.7 /opt/spark
  1. 配置Spark环境变量

编辑 ~/.bashrc文件,配置Spark的环境变量:

# 打开.bashrc文件
nano ~/.bashrc

# 添加以下内容
export SPARK_HOME=/opt/spark
export PATH=$SPARK_HOME/bin:$PATH
export PATH=$SPARK_HOME/sbin:$PATH

保存并退出后,执行以下命令来使环境变量生效:

source ~/.bashrc
  1. 启动Spark

你可以通过以下命令来启动Spark:

  • 本地模式:使用本地模式启动Spark
$ spark-shell
  • 集群模式:如果你已经设置了Spark集群,可以使用以下命令提交应用程序:
$ spark-submit --master spark://<MasterNode>:7077 <your_spark_app.py>

2.2 配置Spark集群(Master和Worker)

Spark集群由Master节点和多个Worker节点组成。你需要配置Spark Master和Worker的配置文件。

  1. 配置Master节点

编辑 $SPARK_HOME/conf/spark-env.sh文件,设置Master节点的主机名和端口:

export SPARK_MASTER_HOST=<master_node_host>
export SPARK_MASTER_PORT=7077
  1. 配置Worker节点

在Worker节点上,启动Spark Worker并连接到Master节点:

$ spark-class org.apache.spark.deploy.worker.Worker spark://<master_host>:7077
  1. 启动Spark集群

在Master节点上启动Spark Master:

$ start-master.sh

在Worker节点上启动Spark Worker:

$ start-slave.sh spark://<master_host>:7077

三、在Windows上安装与配置Spark

3.1 安装Spark

  1. 安装Java

Windows用户需要安装Java,并配置 JAVA_HOME环境变量。可以通过以下步骤安装:

  • 下载并安装JDK:JDK下载页面
  • 配置环境变量:JAVA_HOME指向Java安装路径,更新 Path环境变量。
  1. 下载并解压Spark

从Spark官网下载Windows的预构建版本,然后解压到一个目录:

# 下载并解压
wget https://archive.apache.org/dist/spark/spark-2.4.7/spark-2.4.7-bin-hadoop2.7.tgz
tar -xvf spark-2.4.7-bin-hadoop2.7.tgz
  1. 配置Spark环境变量

编辑 System Properties -> Advanced -> Environment Variables,设置以下环境变量:

  • SPARK_HOME:Spark的安装路径
  • PATH:添加 $SPARK_HOME/binPath变量
  1. 启动Spark

Windows用户可以使用 bin\spark-shell.cmd来启动Spark的交互式Shell。或者使用 bin\spark-submit.cmd提交Spark作业。


四、在Mac上安装与配置Spark

4.1 安装Spark

  1. 安装Homebrew

Homebrew是MacOS上的一个包管理工具,如果没有安装,可以通过以下命令安装:

/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"
  1. 通过Homebrew安装Spark

在终端中执行以下命令:

brew install apache-spark
  1. 配置环境变量

编辑 ~/.bash_profile(或 ~/.zshrc,取决于你使用的shell),添加Spark的环境变量:

export SPARK_HOME=/usr/local/opt/apache-spark/libexec
export PATH=$SPARK_HOME/bin:$PATH
export PATH=$SPARK_HOME/sbin:$PATH
  1. 启动Spark

通过以下命令启动Spark:

spark-shell

五、在Ubuntu上安装与配置Spark

5.1 安装Spark

在Ubuntu上安装Spark的步骤与Linux类似。可以通过APT包管理器安装依赖项,然后手动下载并安装Spark。

  1. 安装Java

使用以下命令安装OpenJDK:

sudo apt update
sudo apt install openjdk-8-jdk
  1. 下载并安装Spark

下载Spark压缩包并解压:

wget https://archive.apache.org/dist/spark/spark-2.4.7/spark-2.4.7-bin-hadoop2.7.tgz
tar -xvf spark-2.4.7-bin-hadoop2.7.tgz
sudo mv spark-2.4.7-bin-hadoop2.7 /opt/spark
  1. 配置环境变量

编辑 ~/.bashrc,添加Spark的环境变量:

export SPARK_HOME=/opt/spark
export PATH=$SPARK_HOME/bin:$PATH
export PATH=$SPARK_HOME/sbin:$PATH

运行 source ~/.bashrc使其生效。

  1. 启动Spark

使用以下命令启动Spark:

spark-shell

六、安装和配置Hadoop(如果涉及HDFS)

如果你希望使用HDFS存储系统来存储Spark的数据,需要安装和配置Hadoop。

  1. 下载并安装Hadoop

从Hadoop的官网下载并解压:

wget https://archive.apache.org/dist/hadoop/common/hadoop-3.3.0/hadoop-3.3.0.tar.gz
tar -xvf hadoop-3.3.0.tar.gz
sudo mv hadoop-3.3.0 /opt/hadoop
  1. 配置Hadoop环境变量

编辑 ~/.bashrc,设置Hadoop的环境变量:

export HADOOP_HOME=/opt/hadoop
export PATH=$HADOOP_HOME/bin:$PATH
export PATH=$HADOOP_HOME/sbin:$PATH

export HDFS_NAMENODE_USER=root
export HDFS_DATANODE_USER=root
export HDFS_SECONDARYNAMENODE_USER=root
  1. 配置HDFS

修改 core-site.xmlhdfs-site.xml来配置HDFS。

/opt/hadoop/etc/hadoop/core-site.xml

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
  Licensed under the Apache License, Version 2.0 (the "License");
  you may not use this file except in compliance with the License.
  You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

  Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an "AS IS" BASIS,
  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  See the License for the specific language governing permissions and
  limitations under the License. See accompanying LICENSE file.
-->

<!-- Put site-specific property overrides in this file. -->

<configuration>
        <property>
            <name>fs.defaultFS</name>
            <value>hdfs://hadoop:9000</value>
        </property>

        <property>
            <name>hadoop.tmp.dir</name>
            <value>/opt/hadoop/data/tmp</value>
        </property>
</configuration>

/opt/hadoop/etc/hadoop/hdfs-site.xml

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
  Licensed under the Apache License, Version 2.0 (the "License");
  you may not use this file except in compliance with the License.
  You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

  Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an "AS IS" BASIS,
  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  See the License for the specific language governing permissions and
  limitations under the License. See accompanying LICENSE file.
-->

<!-- Put site-specific property overrides in this file. -->

<configuration>
         <property>
                <name>dfs.replication</name>
                <value>1</value>
                <description>副本个数,配置默认是3,应小于datanode机器数量</description>
        </property>
        <property>
                <name>dfs.http.address</name>
                <value>0.0.0.0:50070</value>
                <description>将绑定IP改为0.0.0.0,而不是本地回环IP,这样,就能够实现外网访问本机的50070端口了</description>
        </property>
</configuration>

/opt/hadoop/etc/hadoop/mapred-site.xml

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
  Licensed under the Apache License, Version 2.0 (the "License");
  you may not use this file except in compliance with the License.
  You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

  Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an "AS IS" BASIS,
  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  See the License for the specific language governing permissions and
  limitations under the License. See accompanying LICENSE file.
-->

<!-- Put site-specific property overrides in this file. -->

<configuration>
        <property>
                <name>mapreduce.framework.name</name>
                <value>yarn</value>
        </property>
</configuration>

/opt/hadoop/etc/hadoop/yarn-site.xml

<?xml version="1.0"?>
<!--
  Licensed under the Apache License, Version 2.0 (the "License");
  you may not use this file except in compliance with the License.
  You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

  Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an "AS IS" BASIS,
  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  See the License for the specific language governing permissions and
  limitations under the License. See accompanying LICENSE file.
-->
<configuration>

<!-- Site specific YARN configuration properties -->
        <property>
                <name>yarn.nodemanager.aux-services</name>
                <value>mapreduce_shuffle</value>
        </property>
</configuration>

修改hostname

vim /etc/hostname

hadoop

vim /etc/hosts
127.0.0.1 hadoop

首先执行如下命令格式化hdfs

hdfs namenode -format
  1. 启动Hadoop

启动Hadoop的NameNode和DataNode:

start-dfs.sh

七、总结

在本篇博客中,我们详细介绍了如何在不同操作系统(Linux、Windows、Mac、Ubuntu)上安装并配置Apache Spark。无论是在本地模式还是集群模式,Spark都能为大规模数据处理提供高效的支持。在集群模式下,我们还介绍了如何配置Master和Worker节点,并启动Spark集群。此外,如果涉及HDFS,我们还介绍了Hadoop的安装与配置。

通过这些步骤,你可以轻松地在不同平台上部署Spark,并根据你的需求进行优化配置,助力大数据应用的开发和分析。