JSONとPythonのRDDをスパークする

私はSparkの新機能です。SparkにJSONの入力を理解させようとしていますが、管理していません。要約すると、SparkのALSアルゴリズムを使用して推奨事項を提示しています。入力としてCSVファイルを提供すると、すべて正常に動作します。JSONとPythonのRDDをスパークする

all_user_recipe_rating = [{'rating': 1, 'recipe_id': 8798, 'user_id': 2108}, {'rating': 4, 'recipe_id': 6985, 'user_id': 4236}, {'rating': 4, 'recipe_id': 13572, 'user_id': 2743}, {'rating': 4, 'recipe_id': 6312, 'user_id': 3156}, {'rating': 1, 'recipe_id': 12836, 'user_id': 768}, {'rating': 1, 'recipe_id': 9237, 'user_id': 1599}, {'rating': 2, 'recipe_id': 16946, 'user_id': 2687}, {'rating': 2, 'recipe_id': 20728, 'user_id': 58}, {'rating': 4, 'recipe_id': 12921, 'user_id': 2221}, {'rating': 2, 'recipe_id': 10693, 'user_id': 2114}, {'rating': 2, 'recipe_id': 18301, 'user_id': 4898}, {'rating': 2, 'recipe_id': 9967, 'user_id': 3010}, {'rating': 2, 'recipe_id': 16393, 'user_id': 4830}, {'rating': 4, 'recipe_id': 14838, 'user_id': 583}] 

ratings_RDD = self.spark.parallelize(all_user_recipe_rating) 

ratings = ratings_RDD.map(lambda row: 
    (Rating(int(row['user_id']), 
    int(row['recipe_id']), 
    float(row['rating'])))) 

model = self.build_model(ratings)

これは、私はいくつかの例を見た後に思い付いたものですが、これは私が得るものです：：しかし、次のように私の入力は、JSONは、実際にある

MatrixFactorizationModel: User factor is not cached. Prediction could be slow. 
16/12/21 03:54:53 WARN MatrixFactorizationModel: Product factor does not have a partitioner. Prediction on individual records could be slow. 
16/12/21 03:54:53 WARN MatrixFactorizationModel: Product factor is not cached. Prediction could be slow. 
16/12/21 03:54:53 WARN MatrixFactorizationModelWrapper: User factor does not have a partitioner. Prediction on individual records could be slow.

そして

File "/usr/local/spark/python/pyspark/mllib/recommendation.py", line 147, in <lambda> 
user_product = user_product.map(lambda u_p: (int(u_p[0]), int(u_p[1]))) 
TypeError: int() argument must be a string or a number, not 'Rating'

誰かが私に手を差し伸べることができますか？ :)ありがとう！

出典

2016-12-21 Larissa Leite

よく、

あなたのエラーは1つのことによって発生します。

この例外は、predictAllがALS functionです。

>>> from pyspark.mllib.recommendation import Rating 
>>> ratings = ratings_RDD.map(lambda row: 
... (Rating(int(row['user_id']), 
... int(row['recipe_id']), 
... float(row['rating'])))) 
>>> model = ALS.trainImplicit(ratings, 1, seed=10) 
>>> to_predict = spark.parallelize([[2108, 16393], [583, 20728]]) 
>>> model.predictAll(to_predict).take(2) 
[Rating(user=583, product=20728, rating=0.0741161997082127), Rating(user=2108, product=16393, rating=0.05669039815320609)]

あなたのJSONは次のとおりです。

ここでの問題は、私はあなたのコードを取っRDD<int, int>

を受け取り、あなたが必要なものを構築するために必要な機能への評価・オブジェクトを送信しようとしているということです間違っていない、あなたが持っているものはpredictAllにRDD<int, int>

の代わりに Ratingオブジェクトを送信しているときに問題になります

出典

2016-12-21 19:12:41

JSONとPythonのRDDをスパークする

答えて

関連する問題