我使用以下Dockerfile创建了一个Spark容器:
FROM ubuntu:16.04 RUN apt-get update -y && apt-get install -y \ default-jdk \ nano \ wget && \ apt-get clean && \ rm -rf /var/lib/apt/lists/* RUN useradd --create-home --shell /bin/bash ubuntu ENV HOME /home/ubuntu ENV SPARK_VERSION 2.4.3 ENV HADOOP_VERSION 2.6 ENV MONGO_SPARK_VERSION 2.2.0 ENV SCALA_VERSION 2.11 WORKDIR ${HOME} ENV SPARK_HOME ${HOME}/spark-${SPARK_VERSION}-bin-hadoop${HADOOP_VERSION} ENV PATH ${PATH}:${SPARK_HOME}/bin COPY files/times.json /home/ubuntu/times.json COPY files/README.md /home/ubuntu/README.md COPY files/examples.scala /home/ubuntu/examples.scala COPY files/initDocuments.scala /home/ubuntu/initDocuments.scala RUN chown -R ubuntu:ubuntu /home/ubuntu/* USER ubuntu # get spark RUN wget http://apache.mirror.digitalpacific.com.au/spark/spark-${SPARK_VERSION}/spark-${SPARK_VERSION}-bin-hadoop${HADOOP_VERSION}.tgz && \ tar xvf spark-${SPARK_VERSION}-bin-hadoop${HADOOP_VERSION}.tgz RUN rm -fv spark-${SPARK_VERSION}-bin-hadoop${HADOOP_VERSION}.tgz
我还有两个用Scala编程语言编写的文件,这对我来说听起来很新。问题在于容器只知道Java,而没有安装任何其他命令。有什么方法可以在容器上没有安装任何程序的情况下运行Scala?
文件名是examples.scala和initDocuments.scala。这是initDocuments.scala文件:
examples.scala
initDocuments.scala
import com.mongodb.spark._ import com.mongodb.spark.config._ import org.bson.Document val rdd = MongoSpark.load(sc) if (rdd.count<1){ val t = sc.textFile("times.json") val converted = t.map((tuple)=>Document.parse(tuple)) converted.saveToMongoDB(WriteConfig(Map("uri"->"mongodb://mongodb/spark.times"))) println("Documents inserted.") } else { println("Database 'spark' collection 'times' is not empty. Maybe you've loaded a data into the collection previously ? skipping process. ") } System.exit(0);
我也尝试了以下方法,但不起作用。
spark-shell --conf "spark.mongodb.input.uri=mongodb://mongodb:27017/spark.times" --conf "spark.mongodb.output.uri=mongodb://mongodb/spark.output" --packages org.mongodb.spark:mongo-spark-connector_${SCALA_VERSION}:${MONGO_SPARK_VERSION} -i ./initDocuments.scala
错误:
Ivy Default Cache set to: /home/ubuntu/.ivy2/cache The jars for the packages stored in: /home/ubuntu/.ivy2/jars :: loading settings :: url = jar:file:/home/ubuntu/spark-2.4.3-bin-hadoop2.6/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml org.mongodb.spark#mongo-spark-connector_2.11 added as a dependency :: resolving dependencies :: org.apache.spark#spark-submit-parent-d0f95242-e9b9-4d49-8dde-42afc7c55e9a;1.0 confs: [default] You probably access the destination server through a proxy server that is not well configured. You probably access the destination server through a proxy server that is not well configured. You probably access the destination server through a proxy server that is not well configured. You probably access the destination server through a proxy server that is not well configured. :: resolution report :: resolve 40879ms :: artifacts dl 0ms :: modules in use: --------------------------------------------------------------------- | | modules || artifacts | | conf | number| search|dwnlded|evicted|| number|dwnlded| --------------------------------------------------------------------- | default | 1 | 0 | 0 | 0 || 0 | 0 | --------------------------------------------------------------------- :: problems summary :: :::: WARNINGS Host repo1.maven.org not found. url=https://repo1.maven.org/maven2/org/mongodb/spark/mongo-spark-connector_2.11/2.2.0/mongo-spark-connector_2.11-2.2.0.pom Host repo1.maven.org not found. url=https://repo1.maven.org/maven2/org/mongodb/spark/mongo-spark-connector_2.11/2.2.0/mongo-spark-connector_2.11-2.2.0.jar Host dl.bintray.com not found. url=https://dl.bintray.com/spark-packages/maven/org/mongodb/spark/mongo-spark-connector_2.11/2.2.0/mongo-spark-connector_2.11-2.2.0.pom Host dl.bintray.com not found. url=https://dl.bintray.com/spark-packages/maven/org/mongodb/spark/mongo-spark-connector_2.11/2.2.0/mongo-spark-connector_2.11-2.2.0.jar module not found: org.mongodb.spark#mongo-spark-connector_2.11;2.2.0 ==== local-m2-cache: tried file:/home/ubuntu/.m2/repository/org/mongodb/spark/mongo-spark-connector_2.11/2.2.0/mongo-spark-connector_2.11-2.2.0.pom -- artifact org.mongodb.spark#mongo-spark-connector_2.11;2.2.0!mongo-spark-connector_2.11.jar: file:/home/ubuntu/.m2/repository/org/mongodb/spark/mongo-spark-connector_2.11/2.2.0/mongo-spark-connector_2.11-2.2.0.jar ==== local-ivy-cache: tried /home/ubuntu/.ivy2/local/org.mongodb.spark/mongo-spark-connector_2.11/2.2.0/ivys/ivy.xml -- artifact org.mongodb.spark#mongo-spark-connector_2.11;2.2.0!mongo-spark-connector_2.11.jar: /home/ubuntu/.ivy2/local/org.mongodb.spark/mongo-spark-connector_2.11/2.2.0/jars/mongo-spark-connector_2.11.jar ==== central: tried https://repo1.maven.org/maven2/org/mongodb/spark/mongo-spark-connector_2.11/2.2.0/mongo-spark-connector_2.11-2.2.0.pom -- artifact org.mongodb.spark#mongo-spark-connector_2.11;2.2.0!mongo-spark-connector_2.11.jar: https://repo1.maven.org/maven2/org/mongodb/spark/mongo-spark-connector_2.11/2.2.0/mongo-spark-connector_2.11-2.2.0.jar ==== spark-packages: tried https://dl.bintray.com/spark-packages/maven/org/mongodb/spark/mongo-spark-connector_2.11/2.2.0/mongo-spark-connector_2.11-2.2.0.pom -- artifact org.mongodb.spark#mongo-spark-connector_2.11;2.2.0!mongo-spark-connector_2.11.jar: https://dl.bintray.com/spark-packages/maven/org/mongodb/spark/mongo-spark-connector_2.11/2.2.0/mongo-spark-connector_2.11-2.2.0.jar :::::::::::::::::::::::::::::::::::::::::::::: :: UNRESOLVED DEPENDENCIES :: :::::::::::::::::::::::::::::::::::::::::::::: :: org.mongodb.spark#mongo-spark-connector_2.11;2.2.0: not found :::::::::::::::::::::::::::::::::::::::::::::: :: USE VERBOSE OR DEBUG MESSAGE LEVEL FOR MORE DETAILS Exception in thread "main" java.lang.RuntimeException: [unresolved dependency: org.mongodb.spark#mongo-spark-connector_2.11;2.2.0: not found] at org.apache.spark.deploy.SparkSubmitUtils$.resolveMavenCoordinates(SparkSubmit.scala:1306) at org.apache.spark.deploy.DependencyUtils$.resolveMavenDependencies(DependencyUtils.scala:54) at org.apache.spark.deploy.SparkSubmit.prepareSubmitEnvironment(SparkSubmit.scala:315) at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:143) at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86) at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:924) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:933) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
PS: 我试图使用以下命令来更改代理地址,但我认为我的使用情况没有很好的代理。如果有人可以帮助我运行配置良好的代理来解决我的下载问题,我将不胜感激。
export JAVA_OPTS="$JAVA_OPTS -Dhttp.proxyHost=yourserver -Dhttp.proxyPort=8080 -Dhttp.proxyUser=username -Dhttp.proxyPassword=password"
根据下面的错误消息
:: org.mongodb.spark#mongo-spark-connector_2.11;2.2.0: not found
它指示该软件包丢失。检查当前可用的MongoDB Connector for Spark软件包,确认该软件包不再可用(已替换v2.2.6修补程序)。
您可以在sindbach / mongodb-spark-docker上查看带有Docker的MongoDB Spark连接器的更新示例。
附加信息: spark-shell是REPL(读取-评估- 打印循环)工具。它是程序员用于与框架进行交互的交互式外壳。您无需显式执行即可build执行。当您指定它的--packages参数时spark- shell,它将自动获取程序包并将其包含在您的Shell环境中。
build
--packages
spark- shell